Computational biology

\[ \newcommand\inv{^{-1}}\newcommand\invt{^{-t}} \newcommand\bbP{\mathbb{P}} \newcommand\bbR{\mathbb{R}} \newcommand\defined{ \mathrel{\lower 5pt \hbox{${\equiv\atop\mathrm{\scriptstyle D}}$}}} \] 11.1 : High performance computing in bio-informatics
11.2 : Smith-Waterman gene alignment
11.3 : Dynamic programming approaches
11.3.1 : Discussion
11.4 : Short read alignment
11.4.1 : Suffix tree
Back to Table of Contents

11 Computational biology

11.1 High performance computing in bio-informatics

Top > High performance computing in bio-informatics

Bio-informatics differs from some of the other application areas discussed in this book in the sense that we are usually not interested in speeding up one large computation, but in getting better throughput of a large number of simpler computations. For instance, in gene alignment one has a large number of `reads' that need to be aligned independently against a reference genome. Of course we are still interested in speeding up this alignment, but to a large extend we are concerned with speeding up a bio-informatics pipeline of computations.

Another characteristic of bio-informatics computations is that many are not oriented to floating point operations. For instance, the biobench suite  [biobench:ieee] contains the following programs:

11.2 Smith-Waterman gene alignment

Top > Smith-Waterman gene alignment

Sequence alignment algorithms try to find similarities between DNA sequences.

In genome projects the complete genome of an organism is sequenced. This is often done by breaking the chromosomes into a large number of short pieces, which can be read by automated sequencing machines. The sequencing algorithms then reconstruct the full chromosome by finding the overlapping regions.

11.3 Dynamic programming approaches

Top > Dynamic programming approaches

One popular gene alignment algorithm strategy is dynamic programming , which is used in the Needleman-Wunsch algorithm   [NeedlemanWunsch] and variants such as the Smith-Waterman algorithm .

We formulate this abstractly. Suppose $\Sigma$ is an alphabet, and $A=a_0\ldots a_{m-1}$, $B=b_0\ldots b_{n-1}$ are words over this alphabet (if this terminology is strange to you, see appendix  ), then we build a matrix $H_{ij}$ with $i\leq m, j\leq m$ of similarity scores as follows. \newcommand\wm{w_{\mathrm{match}}} \newcommand\ws{w_{\mathrm{mismatch}}} \newcommand\wdel{w_{\mathrm{deletion}}} We first define weights $\wm,\ws$, typically positive and zero or negative respectively, and a `gap scoring' weight function over $\Sigma\cup\{-\}$ \begin{equation} w(a,b)= \begin{cases} \wm&\hbox{if $a=b$}\\ \ws&\hbox{if $a\not=b$} \end{cases} \end{equation} Now we initialize \begin{equation} H_{i,*}\equiv H_{*,j}\equiv 0 \end{equation} and we inductively construct $H_{ij}$ for indices where $i>0$ or $j>0$.

Suppose $A,B$ have been matched up to $i,j$, that is, we have a score $h_{i',j'}$ for relating all subsequences $a_0\ldots a_{i'}$ to $b_0\ldots b_{j'}$ with $i'\leq i,j'\leq j$ except $i'=i,j'=j$, then:

Summarizing: \begin{equation} H_{ij} = \max \begin{cases} 0\\ H_{i-1,j-1}+w(a_i,b_j)&\hbox{match/mismatch case}\\ H_{i-1,j}+\wdel &\hbox{deletion case $a_i$}\\ H_{i,j-1}+\wdel &\hbox{deletion case $b_j$}\\ \end{cases} \end{equation} This gives us a score at $h_{mn}$ and by backtracking we find how the sequences match up.

11.3.1 Discussion

Top > Dynamic programming approaches > Discussion

Data reuse
This algorithm has work proportional to $mn$, with only $m+n$ input, and scalar output. This makes it a good candidate for implementation on GPU .

Data parallism
Typically, many fragments need to be aligned, and all these operations are independent. This means that SIMD approaches, including on GPU , are feasible. If sequences are of unequal length, they can be padded at a slight overhead cost.

Computational structure

\includegraphics{graphics/smith-watermann-diagonal}

Illustration of dependencies in the Smith-Watermann algorithm; diagonals are seen to be independent

In each row or column, the values of the $H$ matrix are defined recursively, so there is no obvious inner or outer loop that is parallel. However, the algorithm has a wavefront structure  [Liu:cudasw2009] ; see figure  and see section  for a further discussion on wavefronts.

Assuming shared memory, we can split the $H$-matrix into blocks, and apply the wavefront algorithm to the blocks: the blocks on each diagonal can be processed in parallel. Inside each block, the diagonals can be processed with vector instructions .

Each diagonal only needs two previous diagonals for its computation, so the required amount of temporary space is linear in the input size.

11.4 Short read alignment

Top > Short read alignment

A common task is to take a short read (50 to 1000 base pairs) and align it to a reference genome. Packages such as SOAP2 (`Short Oligonucleotide Alignment Program')  [soap2:2009] use the Burrow Wheeler transform   [BurrowWheeler:report] , which uses a data structure known as a trie or suffix tree .

11.4.1 Suffix tree

Top > Short read alignment > Suffix tree

http://homepage.usask.ca/~ctl271/857/suffix_tree.shtml

For a word over some alphabet, a suffix is a contiguous substring of the word that includes the last letter of the word. A  suffix tree is a data structure that contains all suffices (of a word?). There are algorithms for constructing and storing such trees in linear time, and searching with time linear in the lenght of the search string. This is based on `edge-label compression', where a suffix is stored by the indices of its first and last character.

Example. The word `mississippi' contains the letters i,m,p,s , so on the first level of the tree we need to match on these. \begin{equation} \begin{array} {*{12}{c}} i&m&p&s\\ \end{array} \end{equation} The `i' can be followed by `p' or `s', so on the next level we need to match on that. \begin{equation} \begin{array} {*{12}{c}} i& &m&p&s\\ p&s\\ \end{array} \end{equation} However, after `ip' is only one possibility, `ippi', and similarly after `is' the string `issi' uniquely follows before we have another choice between `p' or `s': \begin{equation} \begin{array} {*{12}{c}} i & & &m&p&s\\ ppi&ssi\\ &ppi&ssippi\\ \end{array} \end{equation} After we construct the full suffix tree we have a data structure in which the time for finding a string of length $m$ takes time $O(m)$. Constructing the suffix tree for a string of length $n$ can be done in time $O(n)$.

Both genome alignment and signature selection can be done with suffix trees.

Back to Table of Contents