The architectural and organizational/process advantages of containerization (eg., via Docker) are commonly known. However, in constructing images, especially those that serve as the base for other images, adding functionality via package installation is a double edged sword. On one hand we want our images to be most useful for the purposes they are built but—as images are downloaded, moved around our networks and live in our production environments—we pay a real speed and cost price for bloated image sizes. The obvious onus on image creators is to make them as practically small as possible without sacrificing efficicacy and extensibility. This blog shows how we shrunk our images with a pretty simple trick...
The great impetus towards smaller images manifests in a few places:
- OS distros, such a phusion (minimal, Docker-friendly Ubuntu), busybox (intended for embedded systems), and alpine. These provide operating systems that are minimally functional yet can be easily extended.
- Programming/Environments, such as microcontainers from Iron.io.
- Shrink wrapping, such as skinnywhale, docker export, strip-docker-image, work with existing image layers/containers and try to compress them by finding redunancies and commonalities.
When creating Wise.io's open version of the Python datascience base image I found that the OS distro choice does not affect the final image size much, since there are so many dependencies required to get a fully functional data science environment up and running. In advance of a focus on post-image creation shrink wrapping, I wound up looking for ways to shrink down the resulting image in the Dockerfile itself.
The essential point is that since each RUN creates a new layer, one needs to condense logical installation and tear down steps into one line. You can do this easily with chained double ampersands (&&) in the shell. By tearing down/cleaning up in another RUN, your final image will still have the bloat from the previous layers. We needed three major installation/clean up steps in our Dockerfile:
1. System level dependencies
Here you'll notice that in addition to updating the OS, installing new packages, and setting locales, we also purge the cache of apt installation files.
2. (Python) Conda distro and data science friendly Python packages like jupyter notebook, pandas, numpy, matplotlib, plotly, sklearn, scikit-image, nltk, gensim, psycopg2:
3. All the Python packages we want that are not in the standard conda distro channel (e.g. gensim, plotly), but are available via pip:
Here we make sure to remove the cache directory after we're done.
The "trick" is really just two components:
- Put all logically connected installations (e.g. from one package manager) into their own RUN, to produce fewer layers.
- Figure out what the tear down/clean up commands are for those installations/package managers and tack them on to the end of the RUN (e.g., conda clean, rm, ...).
All told, we saved about 46% space (475 MB) just by setting up and tearing down in the same RUN.
If you're a Pythonista/data scientist and would like to give our base image a shot just:
docker pull wiseio/datascience-docker
And get started with jupyter notebooks and more.
We'd love to hear from you if you've got any other tricks to strink down this image.
Thanks to Paul Baines and Henrik Brink for comments on earlier drafts.