WARNING: Because of the Markdown rendering of this blog, tab characters have been replaced with 4 spaces in code blocks. For this reason, the makefile code will not work when copied directly from the post. Instead, you must first replace all 4-space indents with a tab character.
I use GNU Make to automate my data processing pipelines. I've written a tutorial 1 for novices on the basics of using Make for reproducible analysis and I think that everyone who writes more than one script, or runs more than one shell command to process their data can benefit from automating that process. I'm not alone.
However, the investment required to learn Make and to convert an
entire project can seem daunting to many time-strapped researchers.
Even if you aren't
living the dream—rebuilding
a paper from raw data with a single invocation of
make paper
—I still
think you can benefit from adding a simple Makefile
to your project root.
When done right, scripting the tedious parts of your job can
save you time in the long run2.
But the time savings aren't the only reason to do it.
For me, a bigger advantage is that I get to save my mental energy for
more interesting problems3.
Make goes a step further and lets me forget about everything but my
real objective.
With a make [target]
invocation I don't even need to remember the name of the
script.
The default makefile
TL;DR: All of the code in this post is available as a gist.
Here's what a minimal makefile might look like:
define PROJECT_HELP_MSG
Usage:
make help show this message
make clean remove intermediate files (see CLEANUP)
make ${VENV} make a virtualenv in the base directory (see VENV)
make python-reqs install python packages in requirements.pip
make git-config set local git configuration
make setup git init; make python-reqs git-config
make start-jupyter launch a jupyter server from the local virtualenv
endef
export PROJECT_HELP_MSG
help:
echo $$PROJECT_HELP_MSG | less
.git:
git init
git-config: | .git
git config --local filter.dropoutput_jupyter.clean \
drop_jupyter_output.sh
git config --local filter.dropoutput_jupyter.smudge cat
git config --local core.page 'less -x4'
git config --local \
diff.daff-csv.command "daff.py diff --git"
git config --local \
merge.daff-csv.name "daff.py tabular merge"
git config --local \
merge.daff-csv.driver "daff.py merge --output %A %O %A %B"
VENV = .venv
export VIRTUAL_ENV := $(abspath ${VENV})
export PATH := ${VIRTUAL_ENV}/bin:${PATH}
${VENV}:
python3 -m venv $@
python-reqs: requirements.pip | ${VENV}
pip install --upgrade -r requirements.pip
setup: ${VENV} python-reqs git-config | .git
start-jupyter:
jupyter notebook --config=jupyter_notebook_config.py
CLEANUP = *.pyc
clean:
rm -rf ${CLEANUP}
.PHONY: help git-config start-jupter python-reqs setup clean
If you want to start using it right away, download the gist, which includes a couple of other necessary files. As long as you aren't saving it over another makefile, it won't mess anything up.
But let's break it down so you can see how it's made and why it's awesome.
From the top!
A help message for your project
define PROJECT_HELP_MSG
Usage:
make help show this message
make clean remove intermediate files (see CLEANUP)
make git-config set local git configuration
make ${VENV} make a virtualenv in the base directory (see VENV)
make python-reqs install python packages in requirements.pip
make setup git init; make python-reqs git-config
make start-jupyter launch a jupyter server from the local virtualenv
endef
export PROJECT_HELP_MSG
help:
echo $$PROJECT_HELP_MSG
The top of our makefile is a help message.
Running the traditional invocation make help
will call that recipe and we'll
see an abridged list of the available recipes printed to our terminal.
Since help
is the very first recipe in the makefile, it will also be the
default recipe;
typing make
alone prints the help message.
As you start adding additional recipes, fill out this usage message. That way you'll have both documentation about the analysis targets, and also a handy cheatsheet.
Edit (2016-06-15): On Reddit, /r/guepier suggests using a nifty trick to auto-generate these help messages, keeping documentation and recipes together in your makefile.
Streamline git setup
.git:
git init
Every project should be version controlled.
I prefer git, but the makefile can probably be adapted for Mercurial,
Subversion, darcs, etc.
This recipe is so simple as to appear useless (since make .git
is no easier
to type than git init
) but we use the directory .git/
as an
order-only prerequisite
for the next recipe:
git-config: | .git
git config --local filter.dropoutput_jupyter.clean \
drop_jupyter_output.sh
git config --local filter.dropoutput_jupyter.smudge cat
git config --local core.page 'less -x4'
git config --local \
diff.daff-csv.command "daff.py diff --git"
git config --local \
merge.daff-csv.name "daff.py tabular merge"
git config --local \
merge.daff-csv.driver "daff.py merge --output %A %O %A %B"
Git configuration is just annoying enough that I often put it off for a new project. With this recipe I don't have to!
There are three parts to the configuration above; customize it for how you use git.
Drop Jupyter Notebook output
git config --local filter.dropoutput_jupyter.clean \
./drop_jupyter_output.sh
git config --local filter.dropoutput_jupyter.smudge cat
I set up a clean/smudge filter for my Jupyter notebooks.
Outputs of analysis should generally not be version controlled,
and this includes those outputs that are inlined in a Jupyter notebook.
Now, when you git add
and git diff
notebooks, the output from cells will
be automatically ignored.
Thankfully, using this filter won't change the contents of the .ipynb
file
itself, just the contents of the diff.
This does mean, however, that when you git checkout
an old version of your
notebook you'll have to re-execute all of the cells to get the results.
Two other files are needed for this configuration to have any effect.
First, .gitattributes
which is a tab-separated file mapping filename patterns
to special git configuration.
The first line in that file should be the following.
*.ipynb filter=dropoutput_jupyter
(That's a tab after *.ipynb
.)
The second file is the filter drop_jupyter_output.sh
,
which needs to be executable.
#!/usr/bin/env bash
# run `chmod +x drop_jupyter_output.sh` to make it executable.
file=$(mktemp)
cat <&0 >$file
jupyter nbconvert --to notebook --ClearOutputPreprocessor.enabled=True \
$file --stdout 2>/dev/null
Display tabs as four spaces
I also configure less
to show four spaces for tabs.
This makes git diff
-ing my makefile much easier on the eyes.
git config --local core.page 'less -x4'
Smart diff
s for tabular data
Since git considers changes on a per-line basis, looking at
diff
s of comma-delimited and tab-delimited files can get obnoxious.
The program daff
fixes this problem.
We'll configure git to use daff
for all tabular files.
git config --local \
diff.daff-csv.command "daff.py diff --git"
git config --local \
merge.daff-csv.name "daff.py tabular merge"
git config --local \
merge.daff-csv.driver "daff.py merge --output %A %O %A %B"
Just like the output filter for Jupyter notebooks, we need to associate
this configuration with CSVs and TSVs in our .gitattributes
file by adding
the following two lines.
*.[tc]sv diff=daff-csv
*.[tc]sv merge=daff-csv
Automatic python virtual environments
There are plenty of reasons to sandbox your python environments. If you're like me and keep a separate virtual environment for every project, you'll appreciate these recipes to automate creating them and updating packages.
If you don't use python/pip, these recipes can be swapped out for other sandboxing systems.
VENV = .venv
export VIRTUAL_ENV := $(abspath ${VENV})
export PATH := ${VIRTUAL_ENV}/bin:${PATH}
${VENV}:
python3 -m venv $@
python-reqs: requirements.pip | ${VENV}
pip install --upgrade -r requirements.pip
In the top block, we first set a variable VENV
to be the location of our
virtual environment.
We then set VIRTUAL_ENV
and prepend its bin/
to our PATH
.
By exporting these variables, all recipes run from this makefile will
use python packages and executables from the virtual environment.
We don't have to remember to source .venv/bin/activate
first!
(Edit (2016-06-22): Based on my own testing, it would appear that this
approach to virtual environments in recipes does not work with the default
GNU Make version installed on OS X.
It will, however, work with Homebrew's version which is
installed as gmake
instead of make
.
It is unclear to me why the behavior is different.)
The next block is the recipe to initialize the virtual environment. If you're not using Python 3 for your project you will have to edit this one.
And finally, a recipe to install and update all of the packages listed in
requirements.pip
.
If you want to make a change to your python requirements, add it to
requirements.pip
and re-run make python-reqs
.
You can bootstrap other software installations similarly. And, if you discipline yourself to make all changes to your execution environment in this way, you'll have a permanently up-to-date record of your system requirements.
Single-command project setup
setup: ${VENV} python-reqs git-config | .git
With this meta-target a simple make setup
will have our new project
configured and ready to go.
This is particularly useful if you work on multiple machines:
git clone git@github.com:username/project.git
cd project
make setup
is all it takes to get up and running.
Launch your tools without the hassle
I use Jupyter Notebooks a lot.
With this recipe (and the PATH
we export above) I don't have to remember
to activate my virtual environment or invoke specific configuration files
when I launch a server.
start-jupyter:
jupyter notebook --config=jupyter_notebook_config.py
Put whatever you'd like into the config file. I like to keep my notebooks in a subdirectory, so my invocation is a little different:
jupyter notebook --config=ipynb/jupyter_notebook_config.py \
--notebook-dir=ipynb/
And my configuration automatically changes the working directory to the project root when launching a new notebook.
Customize! The same general idea works for any other software you can start from the shell. No need to remember any of the obnoxious command-line flags.
Quick cleanup
CLEANUP = *.pyc
clean:
rm -rf ${CLEANUP}
A ubiquitous target for Make is clean
to tidy up the repository.
With this makefile, run make clean
to remove all the *.pyc
files.
Customize the CLEANUP
variable with filenames and globs you find yourself
rm
-ing repeatedly.
For me, this includes a bunch of *.log
and *.logfile
files.
Fork this code!
That's all I've got for a default makefile. And even this one is more complicated than it has to be; any one component from it can make your life easier when practicing reproducible research.
The whole point is to hide as much of the humdrum stuff as you can so you get to focus on what counts. I've found this makefile saves me both time and, more importantly, mental energy.
The Makefile
, .gitattributes
, requirements.pip
and
drop_jupyter_output.sh
described
here can all be downloaded from this gist4.
Next time you're starting a project, download them to the project directory,
run make setup
, and let me know what you think!
-
My tutorial is designed to fill a three hour Software Carpentry lesson. There are a number of much shorter primers to get you started (e.g. #1, #2, #3). ↩
-
Randall Munroe does not agree. Relevant XKCDs: #1, #2, and #3 ↩
-
John Cook makes this argument on his blog. ↩
-
Even better, you could write a recipe to download those files on
make setup
! ↩
Comments