Learning to be productive in bioinformatics
Updated: Apr 28, 2022
It has been five months since I have officially started working as a professional bioinformatician. Here are some of the things I've learned that may be helpful to fellow entry-level bioinformaticians:
If any analysis takes too long to run, take the time to look for alternatives instead of waiting for the program to finish.
If pandas are too slow in extremely large datasets, use Vaex instead (if it works for you, try modin).
Some biopython codes are too slow because it has a lot of overhead processes before getting to the actual work. For instance, when parsing large fasta or fastq files, use pyfastx instead of biopython's SeqIO.parse.
If the best tool is too slow, update to the latest version and try again. I made the mistake of running an older version of Samtools for depth calculation in bam files because it has a convenient docker image. A single analysis took a day. When I ran the same tool in an updated version of Samtools, the same analysis took only 20 minutes for 15 files! Nowadays I find myself using bioconda more than docker because the former has updated tools which I can install in a conda environment.
If optimized tools are sometimes still too slow for my purposes, google for alternatives. I was using local blast because I thought it was the best tool for sequence alignment purposes. In the end, I found a lightning-fast alternative: VSEARCH. Best of all, the global alignment blast-like tool in VSEARCH provided alignment features, which is handy if I ever need to correlate the alignment to other variables for machine learning.
Speed up my own codes
If I write codes, I should find ways to optimize them algorithmically (speed up in codes can be thousand-fold and thus much faster than most multiprocessing, so better be serious in data structures & algorithms).
tqdm is useful and easy to implement to gauge the time required for codes to finish. You can create a progress bar to estimate run times:
from tqdm import tqdm long_list = [1, 2, 3, 4, 5] * 10000000 pbar = tqdm(total = len(dummy_list)) for item in long_list: item += 1 pbar.update()
13%|█▎ | 12728833/100000000 [00:06<00:44, 1971188.17it/s]
(6 seconds passed, 44 more seconds to go)
If data frames are used, try to use vectorization.
If a code is best optimized to my ability, I resort to multiprocessing (I covered it here).
Find ways to do things efficiently before starting
For instance, if I wanted to do machine learning, so I googled for the fastest way to do so effectively and efficiently. It led me to PyCaret, a low-code environment for machine learning. PyCaret trained my dataset on a list of machine learning algorithms and allows me to pick the best ones amongst them, tune them to make them better, and build an ensemble of the best models. During the process, PyCaret takes care of many abstractions like test-train data split, transforming your data, selecting the best features, removing multicollinearity while keeping only the best of the collinear variables and more. I find the default settings reasonable but if you're keen to dive into the details, the API is flexible and well-documented. I believe PyCaret helped me save countless hours and lets me focus on science rather than ML technicalities, writing long scripts and a ton more of trial and error. This could not be possible if I hadn't stopped and looked for faster alternatives to do things.
Before starting a project, take a lot of time to think about it
I recall my naive undergraduate belief that productivity was the unit time spent doing work. I rarely sat down to think and jumped into experiments quickly. I slotted work into every waking minute. In the end, the inefficiency of work had me working long hours. Worst, the ineffectiveness produced work of such low quality that yielded only harsh criticisms from my supervisor. I had the drive but not the intellectual capacity. Such fate is common for graduate students, but I am glad to have experienced it in my long undergraduate research life.
I learned to be patient and resist the pressure to start work immediately. Wait. Read first. It is far better to not do anything for a few weeks by reading and thinking than to do something completely useless for months. To be productive, I need a good direction, and if it takes two weeks to find it, so be it.
Read. Adjust search parameters to fish for any papers missed.
But I found that reading does not help me think. I think only when I write after reading a ton. I write a little, noticed the gaps in my thinking, and read to fill them. "Oh, this is a good direction, but I should check what's done in the industry. Ah, there's a white paper on the algorithm. Seems like this is already done. But what have they not done? From my other reading, I know the clinics mandate certain requirements not met in this algorithm. Can I incorporate clinical considerations into this algorithm? How much will this add value? Hmmm, it's a lot of manual work but does not warrant automation since it is one-time work. Better move on to do something more worthy of my limited time." If I could think like that a few years back, I would have saved so much time doing nonsense!
If writing is not working well, perhaps be reminded that I'm a social animal, and simulating a conversation might help develop independence in thinking. This is what Steve Jobs had to say about Wozniak:
"What I saw with Woz was somebody who was fifty times better than the average engineer. He could have meetings in his head. The Mac team was an attempt to build a whole team like that, A players."
Isaacson, Walter. Steve Jobs (p. 363). Simon & Schuster. Kindle Edition.
The cliche of working smart has a profound significance that, in my case, takes experience to understand. Think hard (by writing) about the direction, work fast to fail at bad ideas fast, pick myself up and keep going. This is my best strategy but I am sure to grow and come back here with better things to share.