This is part 2, for part 1 follow this link.

# Application on Biological sequences

As seen thus far, MC and HMM are powerful methods that can be used for a large variety of purposes. However, we use a special case of HMM named Profile HMM for the study of biological sequences. In the following section, my description of this system should explain the reasoning behind the use of Profile HMM.

# Analysis of a MSA

Let us consider a set of functionally related DNA sequences. Our objective is to characterise them as a “family”, and consequently identify other sequences that might belong to the same family [1].

We start by creating a multiple sequence alignment to highlight conserved positions:

A | C | A | – | – | – | A | T | G |

T | C | A | A | C | T | A | T | C |

A | C | A | C | – | – | A | G | C |

A | G | A | – | – | – | A | T | C |

A | C | C | G | – | – | A | T | C |

It is possible to express this set of sequences as a regular expression. The family pattern for this set of sequences is:

[AT][CG][AC][ACGT]^{*}A[TG][GC]Each position in the regular expression represents the nucleotides in the chain. Multiple options for each position are gathered in a bracket: thus, the first element could equally be an A or a T, the second one a C or G, and so on. The element indicated with a * represents a gap area: only the A is not bracketed, because it is the only possible option of that position.

The regular expression is useful because it allows us to spot the pattern of this family of sequences in a visual and simple compact view. However, the regular expression is not an adequate method when establishing whether other sequences are part of this family.

As an example, let us consider two new sequences 1 and 2:

- TGCA–AGG
- ACAC–ATC

Both sequences fit the regular expression given above and, based on that alone, they could be considered part of the family. However, we can see that the first sequence is formed by the nucleotides occurring the fewest times in the multiple sequence alignment, while the second is formed by those most common. Indeed, in the first position, the T is present only once in the multiple sequence alignment, while A in all other sequence, similarly for the in second position, the G only once and C, for all remaining sequences.

We need a way to measure the “distance” between a new sequence and the original set of family sequences. To solve this problem, we can use MC and HMM.

## Markov Chain

From the model in figure above we can start computing the probability of transition in accordance with the alignment seen previously. As such, we can begin calculating the probability of passing from the nucleotide A to G or from G to T, for example, and forming a transition matrix.

Using the transition matrix, it is possible to obtain the probability of a new sequence through the following relation:

P(seq_{test} \mid M_{T}) =\prod_{i,j=1}^{I,J}P_{ij}Where the probability of a sequence (with respect to a transition matrix (M_{t})) is given by the product of the transition probability from one state to another of the sequence.

The result obtained through this equation is an indication of how much a sequence can be considered part of a family: the higher the value obtained, the greater the extent to which a sequence can be considered part of the family. However, the transition matrix is being obtained regardless of the positions of the states within the sequences. This aspect is not negligible, particularly regarding the study of DNA and protein sequences, where undeniably the position of a nucleotide is more important than its numeric presence.

In order to adapt the Markov Model to the study of sequences, the concept of Profile HMM has been introduced.

## Profile HMM

The Profile HMM is a variation of the Markov Chain in which the position of a multiple sequence alignment become the states of the model; the transition matrix is the probability to pass from one state/position to the next [1]. In this way, the probability of emission for each state is introduced, and thus the probability to have a particular nucleotide on that state of the Markov Chain.

[AT][CG][AC][ACGT]^{*}A[TG][GC]We can rewrite the regular expression presented in the previous section (and above), delivering the assumption of the previous paragraph. Figure below clarifies this concept:

The figure above is a visual representation of the application of the HMM method to the case presented here. Each of the boxes is a state of the HMM, corresponding to the position of the multiple sequence alignment or to each group of the regular expression. Inside the boxes are the possible emissions of the state and the probability of each one, while the external arrows exemplify the probability of transition from one state to the other.

The state on the top of figure above is the only one with the property to stay still, rather than to move to a different state. This property can be used to model an insertion state.

To generalise the scheme even further, more states can be added with different properties until a Profile HMM is formed.

The general standard architecture of a Profile HMM is presented in the figure below:

In the figure above the Profile-HMM has three hidden states [2], plus a “Begin” and “End” state. The squares are the “matches” states, which represent the frequency (emission) of the amino acid or nucleotides.

The diamonds are called the “insert state”. They are used to model the gap insertions of the alignment. They are usually used to test sequences longer than the consensus sequences [2].

The circles are the “deletion states”. They play the role of the silent or null state. They do not match residues or gaps and are used to jump from one column to another. These are usually used for sequences shorter than the consensus sequences.

“Begin” and “End” states have been introduced to help the transaction at the beginning and the end of the Profile HMM. They do not produce any emissions.

The arrows represent the transition probabilities. In the figure above, nine arrows are presented in a module (M, I, D states). This configuration is named Plan 9.

Profile HMM are widely used in biological sequences analysis. They have been proven to be useful for protein classification, motif detection, multiple sequences alignment [3], and protein secondary structure prediction [4]. In general, all of them can be summarised in terms of their major uses: aligning a sequence to a profile and scoring a sequence against a profile. To evaluate a sequence against a profile and obtain a score, more efficient dynamic programming methods than the ones presented above are available. The Forward algorithm [5] identifies the likelihood of a sequence; the Viterbi algorithm [6] identifies the most probable “path” to generate a given sequence; and the Forward-Backward (Baum-Welch) algorithm, used also to train the Profile HMM.

### Bibliography

- An Introduction to Hidden Markov Models for Biological Sequence. Krogh. 1998
- Hidden Markov Models and their Applications in Biological Sequence Analysis. Yoon, Byung-Jun. 2009.
- Hidden Markov models of biological primary sequence information. Baldi at all. 1994.
- Protein topology recognition from secondary structure sequences: application of the hidden Markov models to the alpha class proteins. Di Francesco. 1997.
- Statistical Methods for Speech Recognition. Jelinek. 1997
- Error Bounds for Convolutional Codes and an Asymptotically Optimum Decoding Algorithm. Viterbi. 1967