Bionitio | Software engineering template

Updated: Feb 1

The Pareto distribution, the statistical property that gave rise to the 80/20 rule, tells us that a small number of things (20%) can make a disproportionately large impact (80%). It applies to many things, one of which to me is reading. Every once in a while, I come across a book or journal article that actually makes a difference. I have been meaning to share a particular one, in part because it is practically useful, in part because it seems to have escaped the notice of so many despite its applicability in the myriad of GitHub pages I've seen.

Here I present Bionitio. Briefly, Bionitio provides templates for developing command-line software and thus tackles perhaps the most excruciating problem in bioinformatics: poor software quality. The causes are clearly outlined in the article:

"The underlying causes of poor bioinformatics software quality are multifaceted; however, 2 important factors have been highlighted in the literature: (1) the lack of software engineering training amongst bioinformaticians [2,3,6–11] and (2) the fact that research groups have limited time and money to spend on software quality assurance [10,12–15]."

Consequently, we suffer from the implications:

"As a result many bad practices are recurrently observed in the field. Lack of code documentation and user support makes tools hard to install, understand, and use. Limited or non-existent testing can result in unreliable and buggy behaviour. A high degree of coupling with thelocal computing environment and software dependencies impedes portability. The consequences of using poor-quality software can have a significant impact on scientific outcomes. Substantial amounts of users’ time can be wasted in trying to get programs to work, scientific methods can be difficult to reproduce, and in the worst-case scenario, scientific results can be invalid owing to program errors or incorrect use [3,7,8,10,12,13,16,17]."

To increase the quality of software writing, the authors of Bionitio proposed the following:

"In this work we adopt a pragmatic approach to improving bioinformatics software quality that is summarized by Rule 7 in Carey and Papin’s “Ten simple rules for biologists learning to program”: “develop good habits early on” [19]. The idea is that new bioinformatics tools should be started by copying and modifying a well-written existing example. This allows bioinformaticians to get started quickly on solving the crux of their problem but also ensures that all the ingredients of good programming style and functionality are present from the beginning."

The template for 12 programming languages is provided on Bionitio's GitHub page. When I discovered Bionitio, I was immediately impressed by the professionally organized and well-documented README. However, it took me some time before I implemented the code practices within the programming scripts because it was admittedly too big a leap for me to reach the recommended standards. If you noticed, my initial reluctance and inability is precisely the point made by the authors: "the lack of software engineering training amongst bioinformaticians".


Later, like any good programmer, I strategized to learn Bionitio by breaking it into many parts and implementing each one of them into my codes one by one. I have gotten pretty far, but there's still more to learn. I think Bionitio serves a monumental purpose in bioinformatics, but its lack of popularity (4 citations as of 23 JAN 2021) might be due to its delivery of a titan of information too large for most of us to chew off. It is hardly the Bionitio team's fault: good code practices don't come step by step, they come all at once. I think they did a great job at defining an ideal level bioinformatics software engineers should reach for, but it's up to us to dare to fall short of the ideal while learning, to take things one step at a time and learn over time. In due course, we become better.


In the remaining parts of this article, I will show a Python script to illustrate a simplified version of Bionitio's recommendations. Keep in mind that this is a gateway drug, purposely oversimplified and missing many things, meant to help people understand and get into Bionitio.


Before we start, let's keep the end goal in mind. We want a command-line software that reverse complements every DNA in a fasta file, and output a new fasta file of reverse comp'ed sequences. The command should look like this:

python3 rcomper.py --out ./reverse_complemented_seqs.fasta ./input_seqs.fasta
  • 'python3 rcomper.py' is to tell the computer to run rcomper.py using python3.

  • '--out ./reverse_complemented_seqs.fasta' tells the software to send my output fasta file into the current directory (./) as file name 'reverse_complemented_seqs.fasta'.

  • './input_seqs.fasta' is simply the input fasta file in my current directory.


Let's start writing the Python code.


[1] First, we pen some information about what the software does and where to contact the developer. We also choose a license: if you don't know which license is suitable for you, I recommend using choose a license to help you.

'''
Module      : Main
Description : The main entry point for the program.
Copyright   : (c) WEIYUAN, 23JAN2021 
License     : MIT_LICENSE 
Maintainer  : BIO.WEIYUAN@GMAIL.COM 
Portability : POSIX

The program accepts a list of DNA sequences in FASTA format and
returns a list of their reverse complement in FASTA format
'''

[2] Next, we import the modules and initialize variables that we will use in the programme.

from argparse import ArgumentParser
from Bio import SeqIO
from Bio.Seq import Seq
from Bio.SeqRecord import SeqRecord

PROGRAM_NAME = "rcomper"
PROGRAM_VERSION = 1.00

[3] Now we start to write the functions we will use in the software. An important note is that we won't call the functions yet, we will do so at the end of the script.


The first function we write allows our python script to accept command-line arguments. This function 1. initiate the ArgumentParser object from argparse, and 2. add arguments to the object.

def parse_args(): 
    '''Parse command line arguments.
    '''
    description = 'Read a list of fasta sequences and returns their reverse complement'
    parser = ArgumentParser(description=description)
    
    parser.add_argument('--out',
                        type=str,
                        default = './out.fasta',
                        help='Output file location and file name')        
    parser.add_argument('fasta_file',
                        type=str,
                        help='Input FASTA files')
    return parser.parse_args()

The second function will do all the computation of reading fasta files and reverse complementing. Following this, the third function outputs the new fasta file.

def reverse_complement_all(fasta):

    '''
    Reads fasta file and convert its entries to its reverse complement.
    Return the reverse complement sequence as a SeqRecord.
    '''
    
    # Reads fasta file
    fasta_records = SeqIO.parse(fasta, "fasta")
    
    # Create empty list to store reverse complemented seqs
    reverse_complement_list = [] 
    
    # Loop through fasta_records and 
    # add reverse complement seq to empty list
    for record in fasta_records:
        forward_seq = Seq(str(record.seq))
        reverse_complement_list += [ SeqRecord(forward_seq.reverse_complement(),
                                    id = record.id,
                                    description = '') ]
    return reverse_complement_list

def write_out(reverse_comp_records, file_location):
    '''Write reverse complement fasta file to user-specified location'''
    SeqIO.write(reverse_comp_records, file_location, "fasta")

The final function is main() that will handle the sequence of functions to call

  1. Get inputs from command-line. First runs parse_args() to parse input from command-line, then assigns these inputs to 'fasta' and 'file_location' variables in our python script.

  2. Run reverse_complement_all(fasta) to parse fasta file and reverse complement its sequences.

  3. Run write_out() to output a new fasta file containing the reverse comp'ed sequences.

def main():
    "Orchestrate the execution of the program"
    
    # Get command-line input
    args = parse_args()
    fasta = args.fasta_file
    file_location = args.out
    
    # Run function
    reverse_comp_records = reverse_complement_all(fasta)
    write_out(reverse_comp_records, file_location)

[4] At the bottom of my script, I let the computer know that if my script is run from command-line, then run main() function.

# If this script is run from the command line then call the main function.
if __name__ == '__main__':
    main()

[5] I can now run my software on the command-line. Easy peasy.

python3 rcomper.py --out ./reverse_complemented_seqs.fasta ./input_seqs.fasta

The full python script is provided below. I think this is a good starting point to learn Bionitio, but note that there are many more practices I have not implemented to keep things simpler. Please visit the bionitio-python page if you wish to see a more complete script and the other components of complete software. Let me know in the comments if you have any questions.

'''
Module      : Main
Description : The main entry point for the program.
Copyright   : (c) WEIYUAN, 23JAN2021 
License     : MIT_LICENSE 
Maintainer  : BIO.WEIYUAN@GMAIL.COM 
Portability : POSIX

The program accepts a list of DNA sequences in FASTA format and
returns a list of their reverse complement in FASTA format
'''

from argparse import ArgumentParser
from Bio import SeqIO
from Bio.Seq import Seq
from Bio.SeqRecord import SeqRecord

PROGRAM_NAME = "rcomper"
PROGRAM_VERSION = 1.00

def parse_args():
    '''Parse command line arguments.
    Returns Options object with command line argument values as attributes.
    Will exit the program on a command line error.
    '''
    description = 'Read a list of fasta sequences and returns their reverse complement'
    parser = ArgumentParser(description=description)
    parser.add_argument('--out',
                        type=str,
                        default = './out.fasta',
                        help='Output file location and file name')
    parser.add_argument('fasta_file',
                        type=str,
                        help='Input FASTA files')
    return parser.parse_args()

def reverse_complement_all(fasta):
    # Reads fasta file
    fasta_records = SeqIO.parse(fasta, "fasta")
    # Create empty list to store reverse complemented seqs
    reverse_complement_list = [] 
    # Loop through fasta_records and 
    # add reverse complement seq to empty list
    for record in fasta_records:
        forward_seq = Seq(str(record.seq))
        reverse_complement_list += [ SeqRecord(forward_seq.reverse_complement(),
                                    id = record.id,
                                    description = '') ]
    return reverse_complement_list

def write_out(reverse_comp_records, file_location):
    '''Write reverse complement fasta file to user-specified location'''
    SeqIO.write(reverse_comp_records, file_location, "fasta")

def main():
    "Orchestrate the execution of the program"
    
    # Get command-line input
    args = parse_args()
    fasta = args.fasta_file
    file_location = args.out
    
    # Run function
    reverse_comp_records = reverse_complement_all(fasta)
    write_out(reverse_comp_records, file_location)
    
# If this script is run from the command line then call the main function.
if __name__ == '__main__':
    main()
7 views0 comments

Recent Posts

See All