Hidden Markov Model applied to biological sequence. Part 1

Introduction on Markov Chains Models

The Markov Chains (MC) [1][2] and the Hidden Markov Model (HMM) [3] are powerful statistical models that can be applied in a variety of different fields, such as: protein homologies detection [4]; speech recognition [5]; language processing [6]; telecommunications [7]; and tracking animal behaviour [8][9].

HMM has been widely used in bioinformatics since its inception. It is most commonly applied to the analysis of sequences, specifically to DNA sequences [10], for their classification [11], or the detection of specific regions of the sequence, most notably the work made on CpG islands [12].

Overview

The Markov Chain models can be applied to all situations in which the history of a previous event is known, whether directly observable or not (hidden). In this way, the probability of transition from one event to another can be measured, and the probability of future events computed.

The Markov Chain models are discrete dynamical systems of finite states in which transitions from one state to another are based on a probabilistic model, rather than a deterministic one. It follows that the information for a generic state X of a chain at the time t is expressed by the probabilities of transition from the time: t-1.

In HMM, the previous rules are observed, but the observer can only see the output of the function associated with each state — not the states directly. In other words, the observer can see the event produced at the time t but cannot observe the state that has produced the output.

General Theory

Markov Chain

A Markov Chain provides a theoretical model to describe the behaviour of a discrete-time system. Definition: For a finite state space S=\left\lbrace s_{1}, s_{2},\ ...\ ,s_{N} \right\rbrace with a sequence of random variables X_{1}, X_{2},\ ...\, X_{n}, assuming values in S for which the transitional probability P, of the state s_{j} at the time t, is given by the transitional from the state s_{i} and the time t-1, with probability p_{ij}, thus the probability of transition from state s_{i} to s{j} (Markov assumption). A representation of a Markov Chain is present below:

Representation of a Markov Chain with three states:} This plot shows a representation of a Markov Chain with three states. Each circle (A, B, and C) represents a state of this chain, and the arrows represent the probability of moving from one state to another, different one, or the probability of remaining the same (values near the arrow). This type of Markov Chain is used to describe the change of status in discrete phenomena: for example, the probability of having thymine after a guanine in a particular gene sequence.

In other words, the probability of a state S, at time, is given only by the immediately preceding state; therefore, all events before t-1 can be ignored. For this property, the Markov Chain is called memoryless.

The Markov assumption is also known as the Markov Chain of first-order. A Markov Chain of second-order would have the observation at time t depending on time t-1 and time t-2; however, Markov Chains of orders higher than 1 are rarely used.

As a consequence of the Markov assumption, the entire behaviour of a Markov Chain can be described using a transition matrix. The dimension of a transition matrix is the number of states of the Markov Chain, and all the elements describe the probability of moving from one state to another (or back to the same state):

q_{ij}=P \left ( q_{t}=S_{j} | q_{t-1}=S_{i} \right )\ 1\leq i,\ j\leq N

thus:

A=(a_{ij} )_{i=1,j=1}^{n,n}

Where the following property is true:

a_{ij}\geq 0,\ \sum_{j=1}^{j}a_{ij}=1

The Markov Chain can be described as a triple (S,X,P), a set of states S, with X random variables and a transition probability matrix P.

Example

Let us imagine two events A and B with the following transitional matrix:

ToAB
A0.90.2
B0.10.8

Every time the system is in the state A, there 90% probability that system will remain in state A for the next time. Every time the system moves from state A to B, it will stay in B for 80% of the times.

We can compute the probability of having the succession of events ABBAB, given a starting A, to be:

P=P(AA)P(AB)P(BB)P(BA)P(AB) = 0.9\cdot0.2\cdot0.8\cdot0.1\cdot0.2=0.00288=0.288\%

Hidden Markov Models

As seen so far, for the Markov Chain, each state corresponds to an observable event. However, this type of model appears to be too restrictive, and it cannot be applied to many situations of interest.

The situation might occur in which a stochastic process producing a set of outputs has an underlying process that is not observable, thus the `hidden’ label.

The Hidden Markov Models can be considered as a quintuple N, M, A, B, \mu with the following elements:

  • N, hidden states.
  • The observable symbol per state M. What type of output can come out from each state.
  • The transition probability matrix A={a_{ij}}.
  • The probability distribution of the observable output.
  • Probability for the initial state, \mu.

Transition matrices and emission probability are usually calculated with a pseudo-count [13][14]. This consists of adding a small quantity, typically 1, to all the states in all positions, before calculating the probabilities, to avoid zero-values that could compromise future results.

Application to the biological sequence in the next post.

Bibliography

  1. James R Norris. Markov chains. Cambridge University Press, 1998. Britannica. Markov process, 2016. Available at https://www.
  2. britannica.com/topic/Markov-process.
  3. Leonard E Baum and Ted Petrie. Statistical inference for probabilistic functions of finite state Markov chains. The annals of mathematical statistics, pages 1554–1563, 1966.
  4. Kevin Karplus, Christian Barrett, and Richard Hughey. Hidden Markov Models for Detecting Remote Protein Homologies Santa Cruz , CA 95064 Abstract 1 Introduction 2 Test Sets. 1999.
  5. Lawrence R Rabiner. Readings in Speech Recognition. pages 267–296, 1990.
  6. Jayaweera and Dias. Hidden Markov Model Based Part of Speech Tagger for Sinhala Language. CoRR, abs/1407.2, 2014.
  7. Hans-Gunter Hirsch. HMM adaptation for applications in telecommunication. Speech Communication, 34(1-2):127–139, 2001.
  8. Selcuk Sandikci, Pinar Duygulu, and Bulent Ozgule. HMM Based Behavior Recognition of Laboratory Animals. 2012.
  9. I D Jonsen, M Basson, S Bestley, M V Bravington, T A Patterson, M W Pedersen, R Thomson, U H Thygesen, and S J Wotherspoon. State-space models for bio-loggers: A methodological road map. Deep Sea Research Part II: Topical Studies in Oceanography, 88–89:34–46, 2013.
  10. M J Bishop and E A Thompson. Maximum likelihood alignment of DNA sequences. Journal of molecular biology, 190(2):159–165, Jul 1986.
  11. Robert D Finn, Penelope Coggill, Ruth Y Eberhardt, SeanREddy, Jaina Mistry, Alex L Mitchell, Simon C Potter, Marco Punta, Matloob Qureshi, Amaia Sangrador-Vegas, Gustavo A Salazar, John G Tate, and Alex Bateman. The Pfam protein families database: towards a more sustainable future. Nucleic Acids Research, 44(Database-Issue):279–285, 2016.
  12. Hao Wu, Brian Caffo, Harris A Jaffee, Rafael AIrizarry and Andrew P Feinberg. Redefining CpG islands using hidden Markov models. Biostatistics, page kxq005, 2010.
  13. Anders Krogh. An Introduction to Hidden Markov Models for Biological Sequences. Computational Methods in Molecular Biology, 32:45–63, 1998.
  14. Wikipedia. Pseudocount, 2016.
0 0 vote
Article Rating
Subscribe
Notify of
guest
0 Comments
Inline Feedbacks
View all comments