Wei Yuan
- Feb 7, 2022
- 3 min read

Pitfalls in bioinformatics

Updated: Feb 27, 2022

Knowing what not to do is easier than the converse. By not doing what is wrong, one can only do right. Experience teaches us the distinction between the two in due course, yet it is a grueling procedure requiring us to make and learn from mistakes. But it works. Seasoned veterans make less mistakes than a beginner. The old adage of standing up after every failure is wise and we don't only see it in humans.

One of the most powerful machine learning algorithm in our day is deep learning, and they serve as the basis for the best tools we have in biology, such as in variant calling and splice predictions. Supervised deep learning algorithms rely on allowing the algorithm to make mistakes, learn from it and iterate the process until the error is minimized. The profoundness of this principle is momentous. It is surprising to me that SpliceAI, trained on shallow introns like its competitors, predicted deep intronic mutations better than anything else. To me, deep learning showed the power of iterative learning from mistakes and that we should not be embarrassed by our half-witted missteps; over time we will become better. By extension, this meant that we should be supportive of others, keeping everyone psychologically healthy to enable the iteration.

Nonetheless, a more prudent approach is to learn from others their missteps. Mistakes can be mildly detrimental, or it could be disastrous. Sometimes, you make a mistake and feedback is immediate. In other times, the feedback is delayed. The ideal is to have a mentor who has been through thick and thin and could show you the way. However, as bioinformaticians, we are so rare sometimes that we might be the only computing animal in the department. Fortunately, we are part of a bigger community online, and from reading forums I manage to salvage valuable information. Recently, I came across a post highly relevant to our topic today: What Are The Most Common Stupid Mistakes In Bioinformatics?. Then, I took the liberty of compiling the experiences from the contributions there, combined with mine, and listed them here:

Biology

Forgot to check both strands of DNA
Forgot to reverse complement DNA
Forgot to account for the last element in a file
Forgot that DNA has >4 letters (ATCGN...)
Fail to account for nested/ intercalated annotation features (e.g. genes)

Scripting/ Software engineering

Fail to use containers (e.g. Docker) to maintain dependencies to make programme stable
Fail to account for possible common inputs from users resulting in bugs
Gives output and return 0 instead of error when the output is wrong
Fail to account for upper and lowercase when string matching (ATCG vs atcg)
Fail to account for OS-dependent line breaks (GNU/Linux documents can be incompatible with Window's). Use dos2unix to convert plain text files from DOS/Mac to Unix and vice versa. Do it whenever someone passes me a file from Windows to be run in Linux.
Using wrong script (old version) from different times of development
Not writing test cases or metamorphic test cases when there is no oracle
Not testing the script on a small subset of data first
Fail to catch errors

Bioinformatic analysis

Using the wrong genome (confuse between human and mouse)
Using the wrong genome version/ assembly/ annotation/ release (GRCh38 vs GRCh37)
chr1 is followed by chr10 sometimes, not chr2
Confuse between UCSC "chr1" and Ensembl "1" for chromosome 1
Confuse between 0 and 1-indexing for start positions
Bed files/ python is 0-based but GTF/ R language is 1-based
Confuse between half-open range [) / (], closed [] and open () end points
Always check 0x4 flag in sam files to see if read is mapped before looking at other variables like RNAME, CIGAR, and POS
Not accounting for batch effects which hide real effects and give false positives
Use packages without trying to understand what they actually do
Reinvent the wheel when better tools are already available
Trusting that software will take all input rather than take part of it
Phred score should be +33 for new data, but can be +64 for older data

S - Sanger Phred+33
X - Solexa Solexa+64
I - Illumina 1.3+ Phred+64
J - Illumina 1.5+ Phred+64
L - Illumina 1.8+ Phred+33
P - PacBio Phred+33

Database

Anything once stored in excel (.xlxs) may be hacked (e.g. SEPT9 --> sept-9)
Trusting that a downloaded file is fully downloaded (good to check properties for expected vs actual local file size)

High-Performance Cluster or Cloud jobs

Wrongly assume all jobs have been completed

Command-line

Using rm too casually and deleting the wrong files
cat input1.txt > input2.txt makes whatever was in input1.txt disappear. If we want to append input.txt to input.txt, use ">>" instead of ">"
Forgot to use -o output and everything is printed on stdout in terminal after days of running the software
grep without -w: grep 'seq12' will also find seq121, seq122 and so on
grep works only in whitespace but not strings: printf "foo-choo" | grep -Fw -e "foo" returns foo-choo.

Please let me know if there are anything important I missed. It'd be extremely helpful to me and others who might come across this. Thank you.

Pitfalls in bioinformatics

Recent Posts