This is the first of two posts about my program PyMake. I’ll post the link to Part II here when I’ve written it. While I still agree with some of the many of the views expressed in this piece, I have changed my thinking on Makefiles.
I’ll post a new post about the topic when I take the time to write it. I’ve written a tutorial on using Make for reproducible data analysis.
I am an aspiring but unskilled (not yet skilled?) computer geek. You can observe this for yourself by watching me fumble my way through
vim configuration, multi-threading/processing in Python, and
Rarely do I actually feel like my products are worth sharing with the wider world. The only reason I have a GitHub account is personal convenience and absolute confidence that no one else will ever look at it besides me. (Yes, I realize that I am invalidating the previous sentence with that glaring “Fork me on GitHub” ribbon in the top-right corner of this page. I’m putting myself out there! OKAY?!)
As an aspiring scientist, too, I’ve had plenty of opportunities to practice the relevant skill sets. A laboratory rotation with Titus Brown, and the resulting exposure to his reproducible research and Software Carpentry evangelizing, has certainly influenced the tools and techniques in my belt.
I try to use the
matplotlib stack for my computational and visualization tasks. I am a relatively competent
BASH-ist and I work hard to write my scripts so that they’ll make sense to me 5 years from now. I have even been known to do some of my data analysis in IPython notebooks.
A Pipeline is only sometimes a Makefile
Despite (or maybe because of) my obsession with writing simple, reproducible pipelines, one tool I have never come to terms with is GNU
make. While it’s not quite mainstream for bioinformaticians and other computational folk,
make promises to tie all those *
NIX style scripts together seamlessly and with built-in parallelization, selective re-running, and more, all under a declarative language syntax. I say ‘promises’ because, for me, it never did any of those things.
Now, I don’t want to suggest that this ubiquitous piece of GNU software doesn’t work well. I recognize that it does much of what the average user needs, but for my particular pipeline it just wasn’t the right tool.
My problem was a seemingly simple one. I had a set of gene models (HMMs) and a set of FASTQ formatted sequences from an Illumina sequencer. The goal was to search every sample for every gene using HMMER3 and to output the results (plus a respectable amount of pre- and post-processing). The problem is,
make is designed for software compilation. Processing
foo.o is easy. I, however, was asking
make to generate the product of n samples and m models (complete aside: if you’re curious about how I got the $\LaTeX$ formatting, see this).
While, after a dozen hours of smashing my head against the table, I was able to get my
Makefile to work, it required some really ugly tricks like secondary expansion and gratuitous calls to
sed in my macros (for others with similar problems see here, and here). Plus, debugging
make is torture, surely against the Geneva Conventions.
I wanted to use
make, I swear I did. It’s open source, well used, extensively tested, available on all relevant systems, etc. And I probably could have… but only by keeping the ugly hack or hard-coding the recipe for each model, and that just didn’t jive with my recently acquired simple/reproducible mentality. Converts always are the most zealous, afterall.
They say graduate school is a time to explore
So what did I do? No, I didn’t immediately start writing a make replacement with all of the features I wanted like some over-eager graduate student. Jeeze! What do you people think of me!? First I checked out the extant alternatives… I hated everything. So then I started writing a make replacement with all of the features I wanted.
The result was one of the first pieces of general purpose software to come off my laptop which I wouldn’t be entirely ashamed to show to an experienced programmer. It’s rough, don’t get me wrong, but it does everything I need and is actually kinda pretty internally. Well, at least it was before I fixed some glaring problems. Whatever. The point is I want to share it with the world; what better stage exists for its introduction than this blog, which absolutely no one reads?
…Yeah, I’ll probably post it to /r/python too.
Tune in for Part II, in which I explain why you should use my software.