Notes on BLAST and its automation in Python

BLAST is a classic set of sequence alignment algorithms that is a keystone in biology. If you are new to BLAST, here are some introductory notes I made for an undergraduate workshop. By classic, I meant that BLAST still relied on residue-to-residue comparisons, such as matching each base in the query to the subject sequence. Unlike in the olden days, we now battle titanic datasets so alignment-free methods are becoming more appropriate over time.

Source: Monnet C, Loux V, Gibrat J-F, Spinnler E, Barbe V, et al. (2010). doi:10.1371/journal.pone.0015489

Image licensed under the Creative Commons Attribution 2.5 Generic license.


The core algorithm of BLAST is written in C, one of the fastest programming languages. In terms of its algorithm, it was significantly more efficient than its predecessors, Needleman–Wunsch (1970) and Smith–Waterman (1981), which made BLAST the only one among the three viable for database searches. We can trade sensitivity for speed by employing larger word sizes, asking for fewer target alignments to be made and having a higher expect threshold.


Sequence alignment is bread and butter in biology, and I often find myself coming back to it in a manner that is more complex than before. I think it shows that BLAST is indeed a cornerstone of biology, and it is a tool that can be used in a myriad of ways creatively to solve more problems than one could imagine.


In my encounters with BLAST, here is what I learned that was interesting:


1. There are five main types of BLAST.


Legend

<-> Sequence alignment

=> Translation


Type | Query <-> Subject

BLASTN | DNA <-> DNA

BLASTP | Protein <-> Protein

BLASTX | DNA => Protein <-> Protein

TBLASTN | Protein <-> Protein <= DNA

TBLASTX | DNA => Protein <-> Protein <= DNA


The more complex and more uncommonly used BLAST are the last three. The simplest among them is BLASTX, which accepts DNA as input but translates it into six protein sequences (3 frames per DNA strand * 2 strands) which is then used to query the protein database. TBLASTN is the reverse of BLASTX.


TBLASTX is the most complicated as it takes DNA as input, translate it in six ways, do the same for the subject DNA sequences in the database, then perform 36 protein-protein comparisons. It is useful for finding homologous genes via the protein sequences, which is more conserved than DNA sequences (since the third nucleotide base in a codon is wobbly).


2. There are different flavours of BLASTP


PSI-BLAST is primarily used to find distantly related sequences and I explained how it works in more detail in my notes. PHI-BLAST searches the database for homologous sequences that has a user-specified protein signature. DELTA-BLAST searches the conserved domain database first and from there it derives a position-specific scoring matrix to search the sequence database for homologous sequences.


3. Scoring is arbitrary for DNA, but less so for protein sequences

There is no empirical way to set the scores for matches, mismatches, gaps and gap extensions in BLASTN.

In contrast, for BLASTP you could use pre-defined scoring matrices (e.g. BLOSUM62) to set the scores, which makes scoring less arbitrary.


The scores in these matrices were derived from how probable it is to change from one amino acid to another based on analyzing protein sequences in the database. When two residues are aligned, if they match, the scores are high. If both did not match, but their amino acids share similar properties (e.g. basic side chains, like Arg and Lys), then the scores will still be positive and vice versa.


BLOSUM62 Matrix


To look for distantly related sequences, or when your searches return too few results, you want more relaxed scores. In such cases, try lower BLOSUM matrices or higher PAM matrices (e.g. BLOSUM50 and PAM250). Alternatively, you want the searches to be more specific, use higher BLOSUM or lower PAM.


Interestingly, BLOSUM62 was miscalculated in the past. When it was fixed, the alignment results were worst off. This remains a mystery...


4. Understanding BLAST statistics

Statistical significance of the alignment is provided by the E value. I explained in my notes:

A general rule of thumb: the lower the E value, the better. In fact, the alignments in the description are ordered from the lowest to the highest E value so you see alignment with the highest statistical significance first.


How to read E value? If the E value was 1, then we’d expect to find 1 alignment with the same score by chance alone. You might ask: why not use p-value?


Take a look at this table:

E value p-value

10 0.999

5 0.993


Which is easier to read? Of course, it’s the E value. Rather than saying that the p-value is 0.999, we can say that there are 10 alignments with the same score as our alignment that could have occurred by chance alone.


But what exactly is the statistical significance here? The short answer: it is whether the subject sequence is homologous to your query sequence, or the alignment has occurred purely by chance (i.e. the two sequences are not related to each other evolutionarily at all).


This nuance may seem trivial but it can be important. Recently I was looking at the specificity of capture probes for NGS. Briefly, capture probes are complementary to genome target sequences and trap any sequence it can bind; later, we send the trapped DNA for sequencing. This meant that homology, and thus the blast statistics, have nothing to do with my blasting of probes against the human genome (the probes figuratively flicks their fingers at me and binds to any sequence it can by chance)! In such scenarios, I had to up the E value so that blast does not filter out any hits I might be interested in.


5. There is no position-specific scoring matrix (PSSM) for BLASTN

But we can overcome this by using HMMER instead of BLAST. To explain PSSMs simply, imagine sliding a window along the DNA sequence. This window is trained by aligning tons of the homologous sequences via multiple sequence alignment (MSA) you are interested in, so it marks important sites with high scores but the less important ones with negligible scores. As the window slides along the DNA, when a homologous sequence is reached, the window calls out a high score. If you look within the window at this point, this sequence might not look at all similar to any of the sequences you used to make the MSA.


This is the power of PSSMs as it is sensitive to only the matches of important positions. In other words, PSSMs pick up signals from the noise generated by evolution. In BLAST, PSSM is available for BLASTP in PSI-BLAST (see my notes especially if the quick explanation above did not help you understand PSSMs).


If you want to see HMMER in action, you can check out my project where I used a sequence from Arabidopsis thaliana to mine homologs in Brassica genomes.


6. You can run BLAST locally on command-line, and automate it using Biopython

It's quite simple, let me show you.


First, you can conveniently grab an outdated copy of BLAST+ package from Linux APT. The latest version can be downloaded here but you will need to install it via source code.

sudo apt update
sudo apt-get install ncbi-blast+

Second, if you have not installed biopython, do so:

sudo apt install python3-pip
pip install biopython

Third, make a blast database from fasta sequence. Here my fasta sequences are all DNA, so I use a nucleotide database:

makeblastdb \
-in my_seqs.fasta \
-input_type fasta \
-out ./blast_db/my_seqs \
-title my_seqs \
-parse_seqids \
-blastdb_version 5 \
-dbtype nucl

Fourth, in python I write a function required to run blast:

from Bio.Blast.Applications import NcbiblastnCommandline
    
def run_blast(query_seq, blast_db, threads):
    """
    Run blast client and output XML file in temp folder
    """
    # Determine which blast task to use by length of query seq
    if len(query_seq) <= 30:
        blast_mode = "blastn-short"
    else:
        blast_mode = "blastn"

    # BLAST set-up
    # Note max_target_seqs is 500 by default
    cline = NcbiblastnCommandline(
        db = blast_db,
        word_size = 11,              
        task = blast_mode,          # blastn/ blastn-short
        strand = 'both',            
        out = './output/blast.xml',        # Output file name
        outfmt = 5,                 # Output as XML format (outfmt=5)
        evalue = 0.05,              # E-value
        num_threads = threads)      # Number of threads to use

    # Run BLAST which output XML
    stdout, stderr = cline(stdin = query_seq)
    pass

Fifth, I write a function to read the blast results from the XML file produced by run_blast(). The codes below are adapted from Biopython's manual:

from Bio.Blast import NCBIXML

def parse_blast():
    """
    Reads blast XML output in temp folder and print results
    """
    # Read BLAST results 
    result_handle = open("./output/blast.xml")
    blast_record = NCBIXML.read(result_handle)

    for alignment in blast_record.alignments:
        for hsp in alignment.hsps:
            if hsp.expect < 0.05:
                print("****Alignment****")
                print("sequence:", alignment.title)
                print("length:", alignment.length)
                print("e value:", hsp.expect)
                print(hsp.query[0:70] + "...")
                print(hsp.match[0:70] + "...")
                print(hsp.sbjct[0:70] + "...")
                
query_seq = 'tacttgttgatattggatcgaacaaactggagaaccaacatgctcacgtcacttttagtcccttacatat'
blast_db = './blast_db/my_seqs'
threads = 4
run_blast(query_seq, blast_db, threads)
parse_blast()

The output looks like this:

****Alignment****
sequence: my_seq No definition line
length: 70
e value: 1.62585e-16
TACTTGTTGATATTGGATCGAACAAACTGGAGAACCAACATGCTCACGTCACTTTTAGTCCCTTACAT...
||||||||| | ||||||||||| || ||||  || || |||||||| |||||| |  | ||||||||...
TACTTGTTGGTGTTGGATCGAACCAATTGGAAGACGAATATGCTCACATCACTTCTCATTCCTTACAT...

I leave my codes here if you want to try it out.


You might be wondering where can we find more variables within the hsp class. You can find it in the Biopython manual, but I paste the unified model language (UML) below:




Note that the variables present here are not exhaustive. If you cannot find it in the UML, then I suggest looking at the source code. For instance, I wanted the end position for the subject HSP sequence which is actually found in the source code as 'hsp.sbjct_end'.


There is some odd behaviour I noticed in BLAST+: when the strand parameter is set to either 'plus' or 'minus', it failed to find me long sequences which can be found in the parameter was set to 'both'. This unexpected output is why we should always write test cases to ensure that our developed software works exactly as intended.


For ideas on what automated blast can do, here is my project where I used it to make software that design multiple primers for highly-homologous sequences in the same genome.


52 views0 comments

Recent Posts

See All