Fork me on GitHub

Tutorial: Reproducible bioinformatics pipelines using GNU Make

WARNING: Because of the Markdown rendering of this blog, tab characters have been replaced with 4 spaces in code blocks. For this reason, the makefile code will not work when copied directly from the post. Instead, you must first replace all 4-space indents with a tab character.

For most projects with moderate to intense data analysis you should consider using Make. Some day I'll write a post telling you why, but for now check out this post by Zachary M. Jones1. If you're already convinced, or just want to see what it's all about, read on.

This post is the clone of a tutorial that I wrote for Titus Brown's week-long Bioinformatics Workshop at UC Davis's Bodega Marine Laboratory in February, 2016. For now, the live tutorial lives in a Github repository, although I eventually want to merge all of the good parts into the Software Carpentry Make lesson (repository).

I'm posting this tutorial because I think it's a good introduction to the analysis pipeline approach I have been slowly adopting over the last several years. This approach is even more deeply enshrined in a project template that I have been developing. You can think of this tutorial as a gentle introduction to the motivation for that template.

The goals of this tutorial are three-fold:

  • Teach GNU Make basics,
  • Demonstrate the use of general best-practices (version control, README's, etc.), and
  • Describe my preferred way to organize projects.

While much of the material is original, it was heavily inspired by the Software Carpentry Make lesson which is licensed CC-BY 4.0., and the example project is almost identical.

And it is divided into five sections:

  1. Setup
  2. Motivation
  3. Makefile basics
  4. Make features
  5. Best practices for Make-based projects

Setup

(Estimated time: 15 Minutes)

This tutorial was designed for UNIX systems and has been tested on Amazon EC2 using the Ubuntu Server 14.04 LTS image and a "m3.medium" instance. If you would like to use Windows, Git-Bash (packaged with Git for Windows) is probably your best bet, although it has not been tested on that platform.

For this lesson we will be using an already prepared set of files.

curl https://codeload.github.com/bsmith89/make-example/tar.gz/v1.0-snap \
    > make-example-1.0-snap.tgz
tar -xzf make-example-1.0-snap.tgz
cd make-example-1.0-snap

Let's take a look at the files we will be working with:

sudo apt-get update
sudo apt-get install tree
tree

The tree command produces a handy tree-diagram of the directory.

.
├── books
│   ├── abyss.txt
│   ├── isles.txt
│   ├── last.txt
│   ├── LICENSE_TEXTS.md
│   └── sierra.txt
├── LICENSE.md
├── matplotlibrc
├── plotcount.py
├── README.md
└── wordcount.py

1 directory, 10 files

Be sure that you also have Python 3, Git, and GNU Make.

sudo apt-get install python3 git make

Configure git.

git config --global user.name "Your Name"
git config --global user.email you@example.com

Install matplotlib.

sudo apt-get install python3-matplotlib

Motivation

(Estimated time: 30 minutes)

The most frequently-occurring word occurs approximately twice as often as the second most frequent word. This is Zipf's Law.

Let's imagine that instead of computational biology we're interested in testing Zipf's law in some of our favorite books. We've compiled our raw data, the books we want to analyze (check out head books/isles.txt) and have prepared several Python scripts that together make up our analysis pipeline.

Before we begin, add a README to your project describing what we intend to do.

nano README.md
# Describe what you're going to do. (e.g. "Test Zipf's Law")

The first step is to count the frequency of each word in the book.

./wordcount.py books/isles.txt isles.dat

(The leading './' is required so that Bash knows we're executing a file in the current directory rather than a command in our path.)

Let's take a quick peek at the result.

head -5 isles.dat

shows us the top 5 lines in the output file:

the 3822    6.7371760973
of  2460    4.33632998414
and 1723    3.03719372466
to  1479    2.60708619778
a   1308    2.30565838181

Each row shows the word itself, the number of occurrences of that word, and the number of occurrences as a percentage of the total number of words in the text file.

We can do the same thing for a different book:

./wordcount.py books/abyss.txt abyss.dat
head -5 abyss.dat

Finally, let's visualize the results.

./plotcount.py isles.dat ascii

The ascii argument has been added so that we get a text-based bar-plot printed to the screen.

The script is also able to render a graphical bar-plot using matplotlib and save the figure to a given file.

./plotcount.py isles.dat isles.png

Together these scripts implement a common workflow:

  1. Read a data file.
  2. Perform an analysis on this data file.
  3. Write the analysis results to a new file.
  4. Plot a graph of the analysis results.
  5. Save the graph as an image, so we can put it in a paper.

Writing a "master" script

Running this pipeline for one book is pretty easy using the command-line. But once the number of files and the number of steps in the pipeline expands, this can turn into a lot of work. Plus, no one wants to sit and wait for a command to finish, even just for 30 seconds.

The most common solution to the tedium of data processing is to write a master script that runs the whole pipeline from start to finish.

We can make a new file, run_pipeline.sh that contains:

#!/usr/bin/env bash
# USAGE: bash run_pipeline.sh
# to produce plots for isles and abyss.

./wordcount.py isles.txt isles.dat
./wordcount.py abyss.txt abyss.dat

./plotcount.py isles.dat isles.png
./plotcount.py abyss.dat abyss.png

# Now archive the results in a tarball so we can share them with a colleague.
rm -rf zipf_results
mkdir zipf_results
mv isles.dat abyss.dat isles.png abyss.png zipf_results/
tar -czf zipf_results.tgz zipf_results
rm -r zipf_results

This master script solved several problems in computational reproducibility:

  1. It explicitly documents our pipeline, making communication with colleagues (and our future selves) more efficient.
  2. It allows us to type a single command, bash run_pipeline.sh, to reproduce the full analysis.
  3. It prevents us from repeating typos or mistakes. You might not get it right the first time, but once you fix something it'll (probably) stay that way.

To continue with the Good Ideas, let's put everything under version control.

git init
git add README.md
git commit -m "Starting a new project."
git add wordcount.py plotcount.py matplotlibrc
git commit -m "Write scripts to test Zipf's law."
git add run_pipeline.sh
git commit -m "Write a master script to run the pipeline."

Notice that we didn't version control any of the products of our analysis. We'll talk more about this at the end of the tutorial.

A master script is a good start, but it has a few shortcomings.

Let's imagine that we adjusted the width of the bars in our plot produced by plotcount.py.

nano plotcount.py
# In the definition of plot_word_counts replace:
#    width = 1.0
# with:
#    width = 0.8
git add plotcount.py
git commit -m "Fix the bar width."

Now we want to recreate our figures. We could just bash run_pipeline.sh again. That would work, but it could also be a big pain if counting words takes more than a few seconds. The word counting routine hasn't changed; we shouldn't need to recreate those files.

Alternatively, we could manually rerun the plotting for each word-count file and recreate the tarball.

for file in *.dat; do
    ./plotcount.py $file ${file/.dat/.png}
done

rm -rf zipf_results
mkdir zipf_results
mv isles.dat abyss.dat isles.png abyss.png zipf_results/
tar -czf zipf_results.tgz zipf_results
rm -r zipf_results

But then we don't get many of the benefits of having a master script in the first place.

Another popular option is to comment out a subset of the lines in run_pipeline.sh:

#!/usr/bin/env bash
# USAGE: bash run_pipeline.sh
# to produce plots for isles and abyss.

# These lines are commented out because they don't need to be rerun.
#./wordcount.py isles.txt isles.dat
#./wordcount.py abyss.txt abyss.dat

./plotcount.py isles.dat isles.png
./plotcount.py abyss.dat abyss.png

# Now archive the results in a tarball so we can share them with a colleague.
rm -rf zipf_results
mkdir zipf_results
mv isles.dat abyss.dat isles.png abyss.png zipf_results/
tar -czf zipf_results.tgz zipf_results
rm -r zipf_results

Followed by bash run_pipeline.sh.

But this process, and subsequently undoing it, can be a hassle and source of errors in complicated pipelines.

What we really want is an executable description of our pipeline that allows software to do the tricky part for us: figuring out what steps need to be rerun. It would also be nice if this tool encourage a modular analysis and reusing instead of rewriting parts of our pipeline. As an added benefit, we'd like it all to play nice with the other mainstays of reproducible research: version control, Unix-style tools, and a variety of scripting languages.

Makefile basics

(Estimated time: 45 minutes)

Make is a computer program originally designed to automate the compilation and installation of software. Make automates the process of building target files through a series of discrete steps. Despite it's original purpose, this design makes it a great fit for bioinformatics pipelines, which often work by transforming data from one form to another (e.g. raw dataword counts???profit).

For this tutorial we will be using an implementation of Make called GNU Make, although others exist.

A simple Makefile

Let's get started writing a description of our analysis for Make.

Open up a file called Makefile in your editor of choice (e.g. nano Makefile) and add the following:

isles.dat: books/isles.txt
    ./wordcount.py books/isles.txt isles.dat

We have now written the simplest, non-trivial Makefile2. It is pretty reminiscent of one of the lines from our master script. It is a good bet that you can figure out what this Makefile does.

Be sure to notice a few syntactical items.

The part before the colon is called the target and the part after is our list of prerequisites (there is just one in this case). This first line is followed by an indented section called the recipe. The whole thing is together called a rule.

Notice that the indent is not multiple spaces, but is instead a single tab character. This is the first gotcha in makefiles. If the difference between spaces and a tab character isn't obvious in your editor of choice, try moving your cursor from one side of the tab to the other. It should jump four or more spaces. If your recipe is not indented with a tab character it is likely to not work.

Notice that this recipe is exactly the same as the analogous step in our master shell script. This is no coincidence; Make recipes are shell scripts. The first line (target: prerequisites) explicitly declares two details that were implicit in our pipeline script:

  1. We are generating a file called isles.dat
  2. Creating this file requires books/isles.txt

We'll think about our pipeline as a network of files that are dependent on one another. Right now our Makefile describes a pretty simple dependency graph.

books/isles.txtisles.dat

where the "→" is pointing from requirements to targets.

Don't forget to commit:

git add Makefile
git commit -m "Start converting master script into a Makefile."

Running Make

Now that we have a (currently incomplete) description of our pipeline, let's use Make to execute it.

First, remove the previously generated files.

rm *.dat *.png
make isles.dat

You should see the following print to the terminal:

./wordcount.py books/isles.txt isles.dat

By default, Make prints the recipes that it executes3.

Let's see if we got what we expected.

head -5 isles.dat

The first 5 lines of that file should look exactly like before.

Rerunning Make

Let's try running Make the same way again.

make isles.dat

This time, instead of executing the same recipe, Make prints make: Nothing to be done for 'isles.dat'.

What's happening here?

When you ask Make to make isles.dat it first looks at the modification time of that target. Next it looks at the modification time for the target's prerequisites. If the target is newer than the prerequisites Make decides that the target is up-to-date and does not need to be remade.

Much has been said about using modification times as the cue for remaking files. This can be another Make gotcha, so keep it in mind.

If you want to induce the original behavior, you just have to change the modification time of books/isles.txt so that it is newer than isles.dat.

touch books/isles.txt
make isles.dat

The original behavior is restored.

Sometimes you just want Make to tell you what it thinks about the current state of your files. make --dry-run isles.dat will print Make's execution plan, without actually carrying it out. The flag can be abbreviated as -n.

If you don't pass a target as an argument to make (i.e. just run make) it will assume that you want to build the first target in the Makefile.

More recipes

Now that Make knows how to build isles.dat, we can add a rule for plotting those results.

isles.png: isles.dat
    ./plotcount.py isles.dat isles.png

The dependency graph now looks like:

books/isles.txtisles.datisles.png

Let's add a few more recipes to our Makefile.

abyss.dat: books/abyss.txt
    ./wordcount.py books/abyss.txt abyss.dat

zipf_results.tgz: isles.dat abyss.dat isles.png abyss.png
    rm -rf zipf_results/
    mkdir zipf_results/
    cp isles.dat abyss.dat isles.png abyss.png zipf_results/
    tar -czf zipf_results.tgz zipf_results/
    rm -r zipf_results/

And commit the changes.

git add Makefile
git commit -m "Add recipes for abyss counts, isles plotting, and the final archive."

Here the recipe for zipf_results.tgz involves running a series of shell commands. When building the archive, Make will run each line successively unless any return an error.

Question

Without doing it, what happens if you run make isles.png?

Challenge

What does the dependency graph look like for your Makefile?

Try it

What happens if you run make zipf_results.tgz right now?

Practice

Write a recipe for abyss.png.

Once you've written a recipe for abyss.png you should be able to run make zipf_results.tgz.

Let's delete all of our files and try it out.

rm abyss.* isles.*
make zipf_results.tgz

You should get the something like the following output (the order may be different) to your terminal:

./wordcount.py books/abyss.txt abyss.dat
./wordcount.py books/isles.txt isles.dat
./plotcount.py abyss.dat abyss.png
./plotcount.py isles.dat isles.png
rm -rf zipf_results/
mkdir zipf_results/
cp isles.dat abyss.dat isles.png abyss.png zipf_results/
tar -czf zipf_results.tgz zipf_results/
rm -r zipf_results/

Since you asked for zipf_results.tgz Make looked first for that file. Not finding it, Make looked for its prerequisites. Since none of those existed it remade the ones it could, abyss.dat and isles.dat. Once those were finished it was able to make abyss.png and isles.png, before finally building zipf_results.tgz.

You may also have gotten an additional line in your output similar to the following.

rm abyss.dat isles.dat abyss.png isles.png

Because you only asked for zipf_results.tgz, Make thinks its doing you a favor by deleting the intermediate files. As computational biologists we know to never trust our analyses until they've been tested and intermediate files are a valuable audit trail. To prevent the default behavior, add the following to your Makefile.

.SECONDARY:

Now remove the outputs and rerun your pipeline.

rm zipf_results.tgz *.dat *.png
make zipf_results.tgz

.SECONDARY is one of a handful of special targets used to control Make's behavior.

Try it

What happens if you touch abyss.dat and then make zipf_results.tgz?

git add Makefile
git commit -m "Finish translating pipeline script to a Makefile."
git status

Notice all the files that Git wants to be tracking? Like before, we're not going to version control any of the intermediate or final products of our pipeline. To reflect this fact add a .gitignore file:

*.dat
*.png
zipf_results.tgz
LICENSE.md
git add .gitignore
git commit -m "Have git ignore intermediate data files."
git status

Phony targets

Sometimes its nice to have targets that don't refer to actual files.

all: isles.png abyss.png zipf_results.tgz

Even though this rule doesn't have a recipe, it does have prerequisites. Now, when you run make all Make will do what it needs to to bring all three of those targets up to date.

It is traditional for "all:" to be the first recipe in a makefile, since the first recipe is what is built by default when no other target is passed as an argument.

Another traditional target is "clean". Add the following to your Makefile.

clean:
    rm --force *.dat *.png zipf_results.tgz

Running make clean will now remove all of the cruft.

Watch out, though!

Try it

What happens if you create a file named clean (i.e. touch clean) and then run make clean?

When you run make clean you get make: Nothing to be done for 'clean'.. That's not because all those files have already been removed. Make isn't that smart. Instead, make sees that there is already a file named "clean" and, since this file is newer than all of its prerequisites (there are none), Make decides there's nothing left to do.

To avoid this problem add the following to your Makefile.

.PHONY: all clean

This special target tells Make to assume that the targets "all", and "clean" are not real files; they're phony targets.

git add Makefile
git commit -m "Added 'all' and 'clean' recipes."
rm clean

Make features

(Estimated time: 45 minutes)

Right now our Makefile looks like this:

# Dummy targets
all: isles.png abyss.png zipf_results.tgz

clean:
    rm --force *.dat *.png zipf_results.tgz

.PHONY: all clean
.SECONDARY:

# Analysis and plotting
isles.dat: books/isles.txt
    ./wordcount.py books/isles.txt isles.dat

isles.png: isles.dat
    ./plotcount.py isles.dat isles.png

abyss.dat: books/abyss.txt
    ./wordcount.py books/abyss.txt abyss.dat

abyss.png: abyss.png
    ./plotcount.py abyss.dat abyss.png

# Archive for sharing
zipf_results.tgz: isles.dat abyss.dat isles.png abyss.png
    rm -rf zipf_results/
    mkdir zipf_results/
    cp isles.dat abyss.dat isles.png abyss.png zipf_results/
    tar -czf zipf_results.tgz zipf_results/
    rm -r zipf_results/

Looks good, don't you think? Notice the added comments, starting with the "#" character just like in Python, R, shell, etc.

Using these recipes, a simple call to make builds all the same files that we were originally making either manually or using the master script, but with a few bonus features.

Now, if we change one of the inputs, we don't have to rebuild everything. Instead, Make knows to only rebuild the files that, either directly or indirectly, depend on the file that changed. This is called an incremental build. It's no longer our job to track those dependencies. One fewer cognitive burden getting in the way of research progress!

In addition, a makefile explicitly documents the inputs to and outputs from every step in the analysis. These are like informal "USAGE:" documentation for our scripts.

Parallel Make

And check this out!

make clean
make --jobs

Did you see it? The --jobs flag (just "-j" works too) tells Make to run recipes in parallel. Our dependency graph clearly shows that abyss.dat and isles.dat are mutually independent and can both be built at the same time. Likewise for abyss.png and isles.png. If you've got a bunch of independent branches in your analysis, this can greatly speed up your build process.

D.R.Y. (Don't Repeat Yourself)

In many programming language, the bulk of the language features are there to allow the programmer to describe long-winded computational routines as short, expressive, beautiful code. Features in Python or R like user-defined variables and functions are useful in part because they mean we don't have to write out (or think about) all of the details over and over again. This good habit of writing things out only once is known as the D.R.Y. principle.

In Make a number of features are designed to minimize repetitive code. Our current makefile does not conform to this principle, but Make is perfectly capable of doing so.

Automatic variables

One overly repetitive part of our Makefile: Targets and prerequisites are in both the header and the recipe of each rule.

It turns out, that

isles.dat: books/isles.txt
    ./wordcount.py books/isles.txt isles.dat

can be rewritten as

isles.dat: books/isles.txt
    ./wordcount.py $^ $@

Here we've replaced the prerequisite "books/isles.txt" in the recipe with "$^" and the target "isles.dat" with "$@". Both "$^" and "$@" are variables that refer to all of the prerequisites and target of a rule, respectively. In Make, variables are referenced with a leading dollar sign symbol. While we can also define our own variables, Make automatically defines a number of variables, like the ones I've just shown you4.

Therefore

zipf_results.tgz: isles.dat abyss.dat isles.png abyss.png
    rm -rf zipf_results/
    mkdir zipf_results/
    cp isles.dat abyss.dat isles.png abyss.png zipf_results/
    tar -czf zipf_results.tgz zipf_results/
    rm -r zipf_results/

can now be rewritten as

zipf_results.tgz: isles.dat abyss.dat isles.png abyss.png
    rm -rf zipf_results/
    mkdir zipf_results/
    cp $^ zipf_results/
    tar -czf $@ zipf_results/
    rm -r zipf_results/

That's a little less cluttered, and still perfectly understandable once you know what the variables mean.

Try it

```bash make clean make isles.dat ``````````

You should get the same output as last time. Internally, Make replaced "$@" with "isles.dat" and "$^" with "books/isles.txt" before running the recipe.

Practice

Go ahead and rewrite all of the rules in your Makefile to minimize repetition and take advantage of these automatic variables. Don't forget to commit your work.

Pattern rules

Another deviation from D.R.Y.: We have nearly identical recipes for abyss.dat and isles.dat.

It turns out we can replace both of those rules with just one rule, by telling Make about the relationships between filename patterns.

A "pattern rule" looks like this:

%.dat: books/%.txt
    countwords.py $^ $@

Here we've replaced the book name with a percent sign, "%". The "%" is called the stem and matches any sequence of characters in the target. (Kind of like a "*" (glob) in a path name, but they are not the same.) Whatever it matches is then filled in to the prerequisites wherever there's a "%".

This rule can be interpreted as:

In order to build a file named [something].dat (the target) find a file named books/[that same something].txt (the prerequisite) and run countwords.py [the prerequisite] [the target].

Notice how helpful the automatic variables are here. This recipe will work no matter what stem is being matched!

We can replace both of the rules that matched this pattern (abyss.dat and isles.dat) with just one rule. Go ahead and do that in your Makefile.

Try it

After you've replaced the two rules with one pattern rule, try removing all of the products (make clean) and rerunning the pipeline.

Is anything different now that you're using the pattern rule?

If everything still works, commit your changes to Git.

Practice

Replace the recipes for abyss.png and isles.png with a single pattern rule.

Challenge

Add books/sierra.txt to your pipeline.

(i.e. make all should plot the word counts and add the plots to zipf_results.tgz)

Commit your changes to Git before we move on.

User defined variables

Not all variables in a makefile are of the automatic variety. Users can define their own, as well.

Add this lines at the top of your makefile:

ARCHIVED := isles.dat isles.png \
            abyss.dat abyss.png \
            sierra.dat sierra.png

Just like many other languages, in makefiles "\" is a line-continuation character. Think of this variable definition as a single line without the backslash.

The variable ARCHIVED is a list of the files that we want to include in our tarball. Now wherever we write ${ARCHIVED} it will be replaced with that list of files. The dollar sign, "$", and curly-braces, "{}", are both mandatory when inserting the contents of a variable.

Notice the backslashes in the variable definition splitting the list over three lines, instead of one very long line. Also notice that we assigned to the variable with ":=". This is generally a Good Idea; Assigning with a normal equals sign can result in non-intuitive behavior for reasons that we will not be talking about5. Finally, notice that the items in our list are separated by whitespace, not commas. Prerequisite lists were the same way; this is just how lists of things work in makefiles. If you included commas they would be considered parts of the filenames.

Using this variable we can replace the prerequisites of zipf_results.tgz. That rule would now be:

zipf_results.tgz: ${ARCHIVED}
    rm -rf zipf_results/
    mkdir zipf_results/
    cp $^ zipf_results/
    tar -czf $@ zipf_results/
    rm -r zipf_results/

We can also use ${ARCHIVED} to simplify our cleanup rule.

clean:
    rm --force ${ARCHIVED} zipf_results.tgz

Try it

Try running clean and then all.

Does everything still work?

Best practices for Make-based projects

(Estimated time: 60 minutes)

A Makefile can be an important part of a reproducible research pipeline. Have you noticed how simple it is now to add/remove books from our analysis? Just add or remove those files from the definition of ARCHIVED or the prerequisites for the all target! With a master script approach, like run_pipeline.sh, adding an additional book required either more complicated or less transparent changes.

What's a prerequisite?

We've talked a lot about the power of Make for rebuilding research outputs when input data changes. When doing novel data analysis, however, it's very common for our scripts to be as or more dynamic than the data.

What happens when we edit our scripts instead of changing our data?

Try it

First, run make all so your analysis is up-to-date.

Let's change the default number of entries in the rank/frequency plot from 10 to 5.

(Hint: edit the function definition for plot_word_counts in plotcounts.py to read limit=5.)

Now run make all again. What happened?

As it stands, we have to run make clean followed by make all to update our analysis with the new script. We're missing out on the benefits of incremental analysis when our scripts are changing too.

There must be a better way...and there is! Scripts should be prerequisites too.

Let's edit the pattern rule for %.png to include plotcounts.py as a prerequisites.

%.png: plotcounts.py %.dat
    ./$^ $@

The header makes sense, but that's a strange looking recipe: just two automatic variables.

This recipe works because "$^" is replaced with all of the prerequisites. In order. When building abyss.png, for instance, './$^ $@' becomes ./plotcounts.py abyss.dat, which is actually exactly what we want.

(Remember that we need the leading './' so that Bash knows we're executing a file in the current directory and not a command in our path.)

Try it

What happens when you run the pipeline after modifying your script again?

(Changes to your script can be simulated with touch plotcounts.py.)

Practice

Update your other rules to include the relevant scripts as prerequisites.

Commit your changes.

Directory structure

Take a look at all of the clutter in your project directory (run ls to list all of the files). For such a small project that's a lot of junk! Imagine how hard it would be to find your way around this analysis if you had more than three steps? Let's move some stuff around to make our project easier to navigate.

Store scripts in scripts/

First we'll stow away the scripts.

mkdir scripts/
mv plotcounts.py wordcount.py scripts/

We also need to update our Makefile to reflect the change:

%.dat: countwords.py books/%.txt
    ./$^ $@

%.png: plotcounts.py %.dat
    ./$^ $@

becomes:

%.dat: scripts/countwords.py books/%.txt
    $^ $@

%.png: scripts/plotcounts.py %.dat
    $^ $@

That's a little more verbose, but it is now explicit that countwords.py and plotcount.py are scripts.

Git should have no problem with the move once you tell it which files to be aware of.

git add countwords.py plotcounts.py
git add scripts/countwords.py scripts/plotcounts.py
git add Makefile
git commit -m "Move scripts into a subdirectory."

Great! From here on, when we add new scripts to our analysis they won't clutter up our project root.

"Hide" intermediate files in data/

Speaking of clutter, what are we gonna do about all of these intermediate files!? Put 'em in a subdirectory!

mkdir data/
mv *.tsv data/

And then fix up your Makefile. Adjust the relevant lines to look like this.

# ...

ARCHIVED := data/isles.dat isles.png \
            data/abyss.dat abyss.png \
            data/sierra.dat sierra.png

# ...

data/%.dat: scripts/countwords.py books/%.txt
    $^ $@

%.png: scripts/plotcounts.py data/%.dat

# ...

Thanks to our ARCHIVED variable, making these changes is pretty simple.

We have to make one more change if we don't want Git to bother us about untracked files. Update your .gitignore.

data/*.dat
*.png
zipf_results.tgz
LICENSE.md

Now commit your changes.

git add Makefile
git add .gitignore

Simple!

Output finished products to fig/

Practice

Update your Makefile so that the plots and zipf_results.tgz are in a directory called fig/.

You can call this directory something else if you prefer, but fig/ seems short and descriptive.

Try it

Does your pipeline still execute the way you expect?

File naming

Use file extensions to indicate format

Up to this point, we've been working with three types of data files, each with it's own file extension.

  • ".txt" files: the original book in plain-text
  • ".dat" files: word counts and percentages in a plain-text format
  • ".png" files: PNG formatted barplots

Using file extensions like these clearly indicates to anyone not familiar with your project what software to view each file with; you won't get much out of opening a PNG with a text editor. Whenever possible, use a widely used extension to make it easy for others to understand your data.

File extensions also give us a handle for describing the flow of data in our pipeline. Pattern rules rely on this convention. Our makefile says that the raw, book data feeds into word count data which feeds into barplot data.

But the current naming scheme has one obvious ambiguity: ".dat" isn't particularly descriptive. Lots of file formats can be described as "data", including binary formats that would require specialized software to view. For tab-delimited, tabular data (data in rows and columns), ".tsv" is a more precise convention.

Updating our pipeline to use this extension is as simple as find-and-replace ".dat" to ".tsv" in our Makefile. If you're tired of mv-ing your files every time you change your pipeline you can also make clean followed by make all to check that everything still works.

You might want to update your "clean" recipe to remove all the junk like so:

clean:
    rm -f data/* fig/*

Be sure to commit all of your changes.

Infix processing hints

One of our goals in implementing best practices for our analysis pipeline is to make it easy to change it without rewriting everything. Let's add a preprocessing step to our analysis that puts everything in lowercase before counting words.

The program tr (short for "translate") is a Unix-style filter that swaps one set of characters for another. tr '[:upper:]' '[:lower:]' < [input file] > [output file] will read the mixedcase input file and write all lowercase to the output file.

We can add this to our pipeline. We know the recipe is going to look like this:

tr '[:upper:]' '[:lower:]' < $^ > $@

Challenge

Rewrite your Makefile to update the pipeline with the preprocessing step.

You probably decided to take the pattern books/%.txt as the prerequisite, but what did you opt to name the target?

data/%.txt is an option, but that means we have two files named [bookname].txt, one in books/ and one in data/. Probably not the easiest to differentiate.

A better option is to use a more descriptive filename.

data/%.lower.txt: books/%.txt
    tr '[:upper:]' '[:lower:]' < $^ > $@

By including an infix of .lower. in our filename it's easy to see that one file is a lowercase version of the mixedcase original. Now we can extend our pipeline with a variety of pre- and post-processing steps, give each of them a descriptive infix, and the names will be a self-documenting record of its origins.

For reasons which may be apparent in a minute, let's also make a dummy preprocessing step which will just copy the books verbatim into our data/ directory.

data/%.txt: books/%.txt
    cp $^ $@

And, in the spirit of infixes, we'll rename data/%.tsv to be more descriptive.

data/%.counts.tsv: scripts/wordcount.py data/%.txt
    $^ $@

fig/%.counts.png: scripts/plotcount.py data/%.counts.tsv
    $^ $@

Our Makefile now documents explicitly that we require a tab-delimited table of word counts in order to generate the plot, whereas before any .tsv file would suggest a word-plot was possible.

Here's the full Makefile:

ARCHIVED := data/isles.lower.counts.tsv data/abyss.lower.counts.tsv \
            data/sierra.lower.counts.tsv fig/isles.lower.counts.png \
            fig/abyss.lower.counts.png fig/sierra.lower.counts.png

# Dummy targets
all: fig/isles.lower.counts.png fig/abyss.lower.counts.png \
        fig/sierra.lower.counts.png zipf_results.tgz

clean:
    rm --force data/* fig/*

.PHONY: all clean
.SECONDARY:

# Analysis and plotting
data/%.txt: books/%.txt
    cp $^ $@

data/%.lower.txt: data/%.txt
    tr '[:upper:]' '[:lower:]' < $^ > $@

data/%.counts.tsv: scripts/wordcount.py data/%.txt
    $^ $@

fig/%.counts.png: scripts/plotcount.py data/%.counts.tsv
    $^ $@

# Archive for sharing
zipf_results.tgz: ${ARCHIVED}
    rm -rf zipf_results/
    mkdir zipf_results/
    cp $^ zipf_results/
    tar -czf $@ zipf_results/
    rm -r zipf_results/

Our filenames are certainly more verbose now, but in exchange we get:

  1. self-documenting filenames
  2. more flexible development
  3. and something else, too...
make clean
make fig/abyss.lower.counts.png
make fig/abyss.counts.png

What happened there? We just built two different barplots, one for our analysis with the preprocessing step and one without. Both from the same Makefile. By liberally applying pattern rules and infix filenames we get something like a "filename language". We describe the analyses we want to run and then have Make figure out the details.

Practice

Update your drawing of the dependency graph.

Built-in Testing

It's a Good Idea to check your analysis against some form of ground truth. The simplest version of this is a well-defined dataset that you can reason about independent of your code. Let's make just such a dataset. Let's write a book!

Into a file called books/test.txt add something like this:

My Book
By Me

This is a book that I wrote.

The END

We don't need software to count all of the words in this book, and we can probably imagine exactly what a barplot of the count would look like. If the actual result doesn't look like we expected, then there's probably something wrong with our analysis. Testing your scripts with this tiny book is computationally cheap, too.

Let's try it out!

make fig/test.lower.counts.png
less data/test.lower.counts.tsv

Does your counts data match what you expected?

We should run this test for just about every change we make, to our scripts or to our Makefile. We're going to do that a lot so we'll make it as easy as possible.

test: fig/test.lower.counts.png

.PHONY: test clean all

You could even add the test phony target as the first thing in your Makefile. That way just calling make will run your tests.

Practice

Add a cleanup target called testclean which is specific for the outputs of your test run.

Commit your changes, including books/test.txt.

git add Makefile
git add -f books/test.txt
git commit -m "Add pipeline testing recipe and book."

Review: version control

We have been following three guiding principles in our use of version control during this lesson.

  1. Use it (always).

    Version control is a Good Idea and should be used for any files which describe your pipeline. This includes notes/documentation/TODOs, scripts, and the Makefiles themselves.

  2. Don't version control raw or processed data which can be recreated.

    Raw data stays raw and data cleanup should be part of the pipeline. Because of this, backing up your data is imperative, but version control is not usually the best way to do so. Consider adding a recipe which downloads raw data using wget or curl.

    One exception would be test or example data. These should be version controlled, as they are subject to change as testing is adapted to the evolving pipeline.

    In many cases metadata should be version controlled, since the format and composition of the metadata is intimately linked with the analysis pipeline itself.

  3. Aim to commit "atomic" changes to your pipeline.

    This means you should usually run make test before committing your changes so that regressions don't need to be fixed in subsequent commits. Co-dependent updates to metadata, documentation, and testing should be included in the same commit. In a perfect world, make all should work, and documentation should be up to date, regardless of what revision has been checked out. Excessive application of this principle is ill advised.

    A more common problem are behemoth commits which make large numbers of unrelated changes. In general, a single sentence commit message should be able to summarize all of the changes in a commit.


  1. I swear I didn't know about that post when I titled my tutorial. Great minds think alike? 

  2. While several other filenames will work, it is a Good Idea to always call your Makefile Makefile

  3. Notice that we didn't tell Make to use Makefile. When you run make, the program automatically looks in several places for your Makefile. 

  4. See https://www.gnu.org/software/make/manual/html_node/Automatic-Variables.html]. 

  5. Variables are complicated in Make. Read the extensive documentation about variable assignment. 

Comments