Computational biology

Experimental html version of downloadable textbook, see
\[ \newcommand\inv{^{-1}}\newcommand\invt{^{-t}} \newcommand\bbP{\mathbb{P}} \newcommand\bbR{\mathbb{R}} \newcommand\defined{ \mathrel{\lower 5pt \hbox{${\equiv\atop\mathrm{\scriptstyle D}}$}}} \newcommand\macro[1]{$\langle$#1$\rangle$} \] 14.1 : High performance computing in bio-informatics
14.2 : Smith-Waterman gene alignment
14.3 : Dynamic programming approaches
14.3.1 : Discussion
14.4 : Short read alignment
14.4.1 : Suffix tree
Back to Table of Contents

14 Computational biology

14.1 High performance computing in bio-informatics

crumb trail: > bio > High performance computing in bio-informatics

Bio-informatics differs from some of the other application areas discussed in this book in the sense that we are usually not interested in speeding up one large computation, but in getting better throughput of a large number of simpler computations. For instance, in gene alignment one has a large number of `reads' that need to be aligned independently against a reference genome. Of course we are still interested in speeding up this alignment, but to a large extend we are concerned with speeding up a bio-informatics pipeline of computations.

Another characteristic of bio-informatics computations is that many are not oriented to floating point operations. For instance, the biobench suite  [biobench:ieee] contains the following programs:

  • Sequence similarity: Blast and Fasta

  • Phylogenetic analysis: Phylyp
  • Multiple sequence alignment: Clustal W
  • Sequence profile search: Hmmer
  • Genome-level alignment: Mummer
  • Sequence assembly: Tigr

14.2 Smith-Waterman gene alignment

crumb trail: > bio > Smith-Waterman gene alignment

Sequence alignment algorithms try to find similarities between DNA sequences.

In genome projects the complete genome of an organism is sequenced. This is often done by breaking the chromosomes into a large number of short pieces, which can be read by automated sequencing machines. The sequencing algorithms then reconstruct the full chromosome by finding the overlapping regions.

14.3 Dynamic programming approaches

crumb trail: > bio > Dynamic programming approaches

One popular gene alignment algorithm strategy is dynamic programming , which is used in the Needleman-Wunsch algorithm   [NeedlemanWunsch] and variants such as the Smith-Waterman algorithm .

We formulate this abstractly. Suppose $\Sigma$ is an alphabet, and $A=a_0… a_{m-1}$, $B=b_0… b_{n-1}$ are words over this alphabet (if this terminology is strange to you, see appendix  \ref{app:fsa} ), then we build a matrix $H_{ij}$ with $i\leq m, j\leq m$ of similarity scores as follows.

\[ \newcommand\wm{w_{\mathrm{match}}} \newcommand\ws{w_{\mathrm{mismatch}}} \newcommand\wdel{w_{\mathrm{deletion}}} \]

We first define weights $\wm,\ws$, typically positive and zero or negative respectively, and a `gap scoring' weight function over $\Sigma\cup\{-\}$ \[ w(a,b)= \begin{cases} \wm&\hbox{if $a=b$}\\ \ws&\hbox{if $a\not=b$} \end{cases} \] Now we initialize \[ H_{i,*}\equiv H_{*,j}\equiv 0 \] and we inductively construct $H_{ij}$ for indices where $i>0$ or $j>0$.

Suppose $A,B$ have been matched up to $i,j$, that is, we have a score $h_{i',j'}$ for relating all subsequences $a_0… a_{i'}$ to $b_0… b_{j'}$ with $i'\leq i,j'\leq j$ except $i'=i,j'=j$, then:

  • If $a_i=b_j$, we set the score $h$ at $i,j$ to \[ h_{i,j} = h_{i-1,j-1}+\wm. \]

  • If $a_i\not=b_j$, meaning that the gene sequence was mutated, we set the score $h$ at $i,j$ to \[ h_{i,j} = h_{i-1,j-1}+\ws. \]
  • If $a_i$ was a deleted character in the $B$ sequence, we add $\wdel$ to the score at $i-1,j$: \[ h_{i,j} = h_{i-1,j}+\wdel. \]
  • If $b_j$ was a deleted character in the $A$ sequence, we add $\wdel$ to the score at $i,j-1$: \[ h_{i,j} = h_{i,j-1}+\wdel. \]

Summarizing: \[ H_{ij} = \max \begin{cases} 0\\ H_{i-1,j-1}+w(a_i,b_j)&\hbox{match/mismatch case}\\ H_{i-1,j}+\wdel &\hbox{deletion case $a_i$}\\ H_{i,j-1}+\wdel &\hbox{deletion case $b_j$}\\ \end{cases} \] This gives us a score at $h_{mn}$ and by backtracking we find how the sequences match up.

14.3.1 Discussion

crumb trail: > bio > Dynamic programming approaches > Discussion

Data reuse
This algorithm has work proportional to $mn$, with only $m+n$ input, and scalar output. This makes it a good candidate for implementation on GPUs .

Data parallism
Typically, many fragments need to be aligned, and all these operations are independent. This means that SIMD approaches, including on GPUs , are feasible. If sequences are of unequal length, they can be padded at a slight overhead cost.

Computational structure


Illustration of dependencies in the Smith-Watermann algorithm; diagonals are seen to be independent

In each row or column, the values of the $H$ matrix are defined recursively, so there is no obvious inner or outer loop that is parallel. However, the algorithm has a wavefront structure  [Liu:cudasw2009] ; see figure  labelstring and see section  0.1.1 for a further discussion on wavefronts.

Assuming shared memory, we can split the $H$-matrix into blocks, and apply the wavefront algorithm to the blocks: the blocks on each diagonal can be processed in parallel. Inside each block, the diagonals can be processed with vector instructions .

Each diagonal only needs two previous diagonals for its computation, so the required amount of temporary space is linear in the input size.

14.4 Short read alignment

crumb trail: > bio > Short read alignment

A common task is to take a short read (50 to 1000 base pairs) and align it to a reference genome. Packages such as SOAP2 (`Short Oligonucleotide Alignment Program')  [soap2:2009] use the Burrow Wheeler transform   [BurrowWheeler:report] , which uses a data structure known as a trie or suffix tree .

14.4.1 Suffix tree

crumb trail: > bio > Short read alignment > Suffix tree

For a word over some alphabet, a suffix is a contiguous substring of the word that includes the last letter of the word. A  suffix tree is a data structure that contains all suffices (of a word?). There are algorithms for constructing and storing such trees in linear time, and searching with time linear in the lenght of the search string. This is based on `edge-label compression', where a suffix is stored by the indices of its first and last character.

Example. The word `mississippi' contains the letters i,m,p,s , so on the first level of the tree we need to match on these. \[ \begin{array} {*{12}{c}} i&m&p&s\\ \end{array} \] The `i' can be followed by `p' or `s', so on the next level we need to match on that. \[ \begin{array} {*{12}{c}} i& &m&p&s\\ p&s\\ \end{array} \] However, after `ip' is only one possibility, `ippi', and similarly after `is' the string `issi' uniquely follows before we have another choice between `p' or `s': \[ \begin{array} {*{12}{c}} i & & &m&p&s\\ ppi&ssi\\ &ppi&ssippi\\ \end{array} \] After we construct the full suffix tree we have a data structure in which the time for finding a string of length $m$ takes time $O(m)$. Constructing the suffix tree for a string of length $n$ can be done in time $O(n)$.

Both genome alignment and signature selection can be done with suffix trees.

Back to Table of Contents