Bio-informatics differs from some of the other application areas discussed in this book in the sense that we are usually not interested in speeding up one large computation, but in getting better throughput of a large number of simpler computations. For instance, in gene alignment one has a large number of `reads' that need to be aligned independently against a reference genome. Of course we are still interested in speeding up this alignment, but to a large extend we are concerned with speeding up a bio-informatics pipeline of computations.
Another characteristic of bio-informatics computations is that many are not oriented to floating point operations. For instance, the biobench suite [biobench:ieee] contains the following programs:
Sequence similarity: Blast and Fasta
Phylogenetic analysis: Phylyp
Multiple sequence alignment: Clustal W
Sequence profile search: Hmmer
Genome-level alignment: Mummer
Sequence assembly: Tigr
Sequence alignment algorithms try to find similarities between DNA sequences.
In genome projects the complete genome of an organism is sequenced. This is often done by breaking the chromosomes into a large number of short pieces, which can be read by automated sequencing machines. The sequencing algorithms then reconstruct the full chromosome by finding the overlapping regions.
One popular gene alignment algorithm strategy is dynamic programming , which is used in the Needleman-Wunsch algorithm [NeedlemanWunsch] and variants such as the Smith-Waterman algorithm .
We formulate this abstractly. Suppose $\Sigma$ is an alphabet, and $A=a_0\ldots a_{m-1}$, $B=b_0\ldots b_{n-1}$ are words over this alphabet (if this terminology is strange to you, see appendix ), then we build a matrix $H_{ij}$ with $i\leq m, j\leq m$ of similarity scores as follows. \newcommand\wm{w_{\mathrm{match}}} \newcommand\ws{w_{\mathrm{mismatch}}} \newcommand\wdel{w_{\mathrm{deletion}}} We first define weights $\wm,\ws$, typically positive and zero or negative respectively, and a `gap scoring' weight function over $\Sigma\cup\{-\}$ \begin{equation} w(a,b)= \begin{cases} \wm&\hbox{if $a=b$}\\ \ws&\hbox{if $a\not=b$} \end{cases} \end{equation} Now we initialize \begin{equation} H_{i,*}\equiv H_{*,j}\equiv 0 \end{equation} and we inductively construct $H_{ij}$ for indices where $i>0$ or $j>0$.
Suppose $A,B$ have been matched up to $i,j$, that is, we have a score $h_{i',j'}$ for relating all subsequences $a_0\ldots a_{i'}$ to $b_0\ldots b_{j'}$ with $i'\leq i,j'\leq j$ except $i'=i,j'=j$, then:
If $a_i=b_j$, we set the score $h$ at $i,j$ to \begin{equation} h_{i,j} = h_{i-1,j-1}+\wm. \end{equation}
If $a_i\not=b_j$, meaning that the gene sequence was mutated, we set the score $h$ at $i,j$ to \begin{equation} h_{i,j} = h_{i-1,j-1}+\ws. \end{equation}
If $a_i$ was a deleted character in the $B$ sequence, we add $\wdel$ to the score at $i-1,j$: \begin{equation} h_{i,j} = h_{i-1,j}+\wdel. \end{equation}
If $b_j$ was a deleted character in the $A$ sequence, we add $\wdel$ to the score at $i,j-1$: \begin{equation} h_{i,j} = h_{i,j-1}+\wdel. \end{equation}
Data reuse
This algorithm has work proportional to $mn$, with only $m+n$ input,
and scalar output. This makes it a good candidate for implementation on
GPU
.
Data parallism
Typically, many fragments need to be aligned, and all these operations
are independent. This means that SIMD approaches, including on
GPU
, are feasible. If sequences are of unequal length, they can
be padded at a slight overhead cost.
Computational structure
In each row or column, the values of the $H$ matrix are defined recursively, so there is no obvious inner or outer loop that is parallel. However, the algorithm has a wavefront structure [Liu:cudasw2009] ; see figure and see section for a further discussion on wavefronts.
Assuming shared memory, we can split the $H$-matrix into blocks, and apply the wavefront algorithm to the blocks: the blocks on each diagonal can be processed in parallel. Inside each block, the diagonals can be processed with vector instructions .
Each diagonal only needs two previous diagonals for its computation, so the required amount of temporary space is linear in the input size.
A common task is to take a short read (50 to 1000 base pairs) and align it to a reference genome. Packages such as SOAP2 (`Short Oligonucleotide Alignment Program') [soap2:2009] use the Burrow Wheeler transform [BurrowWheeler:report] , which uses a data structure known as a trie or suffix tree .
http://homepage.usask.ca/~ctl271/857/suffix_tree.shtml
For a word over some alphabet, a suffix is a contiguous substring of the word that includes the last letter of the word. A suffix tree is a data structure that contains all suffices (of a word?). There are algorithms for constructing and storing such trees in linear time, and searching with time linear in the lenght of the search string. This is based on `edge-label compression', where a suffix is stored by the indices of its first and last character.
Example. The word `mississippi' contains the letters i,m,p,s , so on the first level of the tree we need to match on these. \begin{equation} \begin{array} {*{12}{c}} i&m&p&s\\ \end{array} \end{equation} The `i' can be followed by `p' or `s', so on the next level we need to match on that. \begin{equation} \begin{array} {*{12}{c}} i& &m&p&s\\ p&s\\ \end{array} \end{equation} However, after `ip' is only one possibility, `ippi', and similarly after `is' the string `issi' uniquely follows before we have another choice between `p' or `s': \begin{equation} \begin{array} {*{12}{c}} i & & &m&p&s\\ ppi&ssi\\ &ppi&ssippi\\ \end{array} \end{equation} After we construct the full suffix tree we have a data structure in which the time for finding a string of length $m$ takes time $O(m)$. Constructing the suffix tree for a string of length $n$ can be done in time $O(n)$.
Both genome alignment and signature selection can be done with suffix trees.