Updated: Feb 20
I used to struggle a lot to get some command-line software in the dark alley of science to work. And it didn't. For days. When something like that happens, it's difficult not to blame my own incompetency. Then, I realized as someone who writes a lot of in-house software, I don't blame my users if my programme doesn't run. A developer should make their software as easy to use as possible, to the extent that running the programme is intuitive: "Set up by running this line of code, then give me this set of inputs... tell me where the output should go. Done!". Unfortunately, this isn't always the case in bioinformatics.
Looking at the field from a bird's eye view, it's not difficult to see why many programmes don't work. To start, there are cultural and systemic reasons: one, the publication of software tools is the only goal and making of stable and well-maintained tools is a distant second priority. Two, unless the software is built to not need maintenance, they require regular work for posterity. The academic system, however, rewards the publication of new tools but not its maintenance. Three, most bioinformaticians are not trained software and User Experience engineers, so the tools they develop tend to be far-fetch from intuitive IT products like your phone and its apps. These three problems are complex and honestly out of my jurisdiction, so I'll leave them as food for thought without much discussion of their solution here.
These are, however, technical reasons which we can tackle today due to evolving technology. The problem with many bioinformatics software is that they were built in a very specific computing ecosystem: the operating system, the version of their programming language and the other apps used it build it (in other words, the "dependencies").
Any mismatch between your computing ecosystem environment and theirs can render their app broken in your computer. This is not a trivial problem to solve. The dependencies you install for their app (assuming they have listed everything completely) have their own dependencies.
Dependencies are important. For instance, the difference between Python 2 and Python 3, in fortunate situations, will cause the software to break without giving you an output. Alternatively, a terrifying outcome will be the software giving you an output despite the version incompatibility causing an undetected bug, thus handing you erroneous results. Yes, this is terrible. And it really bothers me at night.
With the advent of Docker, the dependency nightmare can end. Briefly, Docker allows you to run containers. In containers, the developers can define a very specific environment in which their apps run. When you run Docker for an app the first time, it installs exactly what is needed for that app to run, and runs it in a developer-defined environment. In short, when you use Docker, you can bid farewell to the hours spent on getting things to run and installing the myriad of sometimes faulty dependencies.
Additionally, Docker makes your analysis reproducible. To run a piece of software, you'll need to download a Docker image: it is the set of instructions used to build your container. A version number is assigned to each image, which means that as long as you specify this version number, you can get the same container as before. In this way, your software or pipeline stays constant. This reproducibility does not end in your own computer; someone who pulls the same image can reproduce the same results in their computer without fuss. The central tenet of reproducibility science is apparently very well-supported by Docker.
Enough of the talk, let's walk the walk. Here's a quick demonstration of Docker using Genome Analysis Tool Kit (GATK). Briefly, GATK provides industry-standard genomic data analysis, and their docker container has a lot of tools to do just that.
After installing Docker, I run this code in my Linux terminal:
docker run -v /mnt/c/aligned_reads:/my_data -it broadinstitute/gatk:184.108.40.206
And poof, I'm inside the container now.
To use the tools, I just have to run their respective command. For instance, let me try getting help information from GATK's variant caller.
gatk HaplotypeCaller --help
Using GATK jar /gatk/gatk-package-220.127.116.11-local.jar Running: java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -jar /gatk/gatk-package-18.104.22.168-local.jar HaplotypeCaller --help USAGE: HaplotypeCaller [arguments] Call germline SNPs and indels via local re-assembly of haplotypes Version:22.214.171.124 Required Arguments: ...
That's all. The simplicity of Docker translates well into your efficiency. Give Docker a try and let me know what you think. Increasingly, I see software implemented with Docker solutions. I believe Docker will be a mainstay so it'd be good if you can hop on and give it a try now. If you want a quick guide on how to set up Docker for Windows, I provide a tutorial here.