Pitfalls in bioinformatics

Updated: Feb 27



Knowing what not to do is easier than the converse. By not doing what is wrong, one can only do right. Experience teaches us the distinction between the two in due course, yet it is a grueling procedure requiring us to make and learn from mistakes. But it works. Seasoned veterans make less mistakes than a beginner. The old adage of standing up after every failure is wise and we don't only see it in humans.


One of the most powerful machine learning algorithm in our day is deep learning, and they serve as the basis for the best tools we have in biology, such as in variant calling and splice predictions. Supervised deep learning algorithms rely on allowing the algorithm to make mistakes, learn from it and iterate the process until the error is minimized. The profoundness of this principle is momentous. It is surprising to me that SpliceAI, trained on shallow introns like its competitors, predicted deep intronic mutations better than anything else. To me, deep learning showed the power of iterative learning from mistakes and that we should not be embarrassed by our half-witted missteps; over time we will become better. By extension, this meant that we should be supportive of others, keeping everyone psychologically healthy to enable the iteration.


Nonetheless, a more prudent approach is to learn from others their missteps. Mistakes can be mildly detrimental, or it could be disastrous. Sometimes, you make a mistake and feedback is immediate. In other times, the feedback is delayed. The ideal is to have a mentor who has been through thick and thin and could show you the way. However, as bioinformaticians, we are so rare sometimes that we might be the only computing animal in the department. Fortunately, we are part of a bigger community online, and from reading forums I manage to salvage valuable information. Recently, I came across a post highly relevant to our topic today: What Are The Most Common Stupid Mistakes In Bioinformatics?. Then, I took the liberty of compiling the experiences from the contributions there, combined with mine, and listed them here:


Biology

  1. Forgot to check both strands of DNA

  2. Forgot to reverse complement DNA

  3. Forgot to account for the last element in a file

  4. Forgot that DNA has >4 letters (ATCGN...)

  5. Fail to account for nested/ intercalated annotation features (e.g. genes)


Scripting/ Software engineering

  1. Fail to use containers (e.g. Docker) to maintain dependencies to make programme stable

  2. Fail to account for possible common inputs from users resulting in bugs

  3. Gives output and return 0 instead of error when the output is wrong

  4. Fail to account for upper and lowercase when string matching (ATCG vs atcg)

  5. Fail to account for OS-dependent line breaks (GNU/Linux documents can be incompatible with Window's). Use dos2unix to convert plain text files from DOS/Mac to Unix and vice versa. Do it whenever someone passes me a file from Windows to be run in Linux.

  6. Using wrong script (old version) from different times of development

  7. Not writing test cases or metamorphic test cases when there is no oracle

  8. Not testing the script on a small subset of data first

  9. Fail to catch errors


Bioinformatic analysis

  1. Using the wrong genome (confuse between human and mouse)

  2. Using the wrong genome version/ assembly/ annotation/ release (GRCh38 vs GRCh37)

  3. chr1 is followed by chr10 sometimes, not chr2

  4. Confuse between UCSC "chr1" and Ensembl "1" for chromosome 1

  5. Confuse between 0 and 1-indexing for start positions

  6. Bed files/ python is 0-based but GTF/ R language is 1-based

  7. Confuse between half-open range [) / (], closed [] and open () end points

  8. Always check 0x4 flag in sam files to see if read is mapped before looking at other variables like RNAME, CIGAR, and POS

  9. Not accounting for batch effects which hide real effects and give false positives

  10. Use packages without trying to understand what they actually do

  11. Reinvent the wheel when better tools are already available

  12. Trusting that software will take all input rather than take part of it

  13. Phred score should be +33 for new data, but can be +64 for older data

  • S - Sanger Phred+33

  • X - Solexa Solexa+64

  • I - Illumina 1.3+ Phred+64

  • J - Illumina 1.5+ Phred+64

  • L - Illumina 1.8+ Phred+33

  • P - PacBio Phred+33


Database

  1. Anything once stored in excel (.xlxs) may be hacked (e.g. SEPT9 --> sept-9)

  2. Trusting that a downloaded file is fully downloaded (good to check properties for expected vs actual local file size)

High-Performance Cluster or Cloud jobs

  1. Wrongly assume all jobs have been completed


Command-line

  1. Using rm too casually and deleting the wrong files

  2. cat input1.txt > input2.txt makes whatever was in input1.txt disappear. If we want to append input.txt to input.txt, use ">>" instead of ">"

  3. Forgot to use -o output and everything is printed on stdout in terminal after days of running the software

  4. grep without -w: grep 'seq12' will also find seq121, seq122 and so on

  5. grep works only in whitespace but not strings: printf "foo-choo" | grep -Fw -e "foo" returns foo-choo.


Please let me know if there are anything important I missed. It'd be extremely helpful to me and others who might come across this. Thank you.

16 views0 comments

Recent Posts

See All