Hidden Markov Model applied to biological sequence. Part 2

This is part 2, for part 1 follow this link.

Application on Biological sequences

As seen thus far, MC and HMM are powerful methods that can be used for a large variety of purposes. However, we use a special case of HMM named Profile HMM for the study of biological sequences. In the following section, my description of this system should explain the reasoning behind the use of Profile HMM.

Analysis of a MSA

Let us consider a set of functionally related DNA sequences. Our objective is to characterise them as a “family”, and consequently identify other sequences that might belong to the same family [1].

We start by creating a multiple sequence alignment to highlight conserved positions:


It is possible to express this set of sequences as a regular expression. The family pattern for this set of sequences is:

[AT][CG][AC][ACGT]^{*}A[TG][GC] Continue reading “Hidden Markov Model applied to biological sequence. Part 2”

Hidden Markov Model applied to biological sequence. Part 1

Introduction on Markov Chains Models

The Markov Chains (MC) [1][2] and the Hidden Markov Model (HMM) [3] are powerful statistical models that can be applied in a variety of different fields, such as: protein homologies detection [4]; speech recognition [5]; language processing [6]; telecommunications [7]; and tracking animal behaviour [8][9].

HMM has been widely used in bioinformatics since its inception. It is most commonly applied to the analysis of sequences, specifically to DNA sequences [10], for their classification [11], or the detection of specific regions of the sequence, most notably the work made on CpG islands [12].


The Markov Chain models can be applied to all situations in which the history of a previous event is known, whether directly observable or not (hidden). In this way, the probability of transition from one event to another can be measured, and the probability of future events computed.

The Markov Chain models are discrete dynamical systems of finite states in which transitions from one state to another are based on a probabilistic model, rather than a deterministic one. It follows that the information for a generic state X of a chain at the time t is expressed by the probabilities of transition from the time: t-1.

Continue reading “Hidden Markov Model applied to biological sequence. Part 1”

Bag-of-words and k-mers

Bag Of Words

The bag of words (BOW) is a strategy usually adopted when we need to convert a document of strings, e.g., words in a book, into a vector of pure numbers that can be used for any sort of mathematical evaluation.

At first, this method was used in text and image classification, but it has recently been introduced into bioinformatics, and can be successfully applied to DNA/AA sequences’ repertoires.

With the BOW approach, we can redefine a highly complex document, such as a picture or a book, into a smaller set of low-level features, called codewords (also a codebook, dictionary or vocabulary).

The quality and origin of features is arbitrary, i.e. if we want to analyse a book or a series of books, we can choose as features all the words present inside the books, or the letters, or the combination of letters. As for the origin, the features can be all words present in the same books or all the words in the English dictionary, etc. As a result, the length of a codebook is defined by the number of features chosen.

  1. Tom likes to go to the cinema on Sunday.
  2. Martin likes to go to the cinema on Tuesday.
  3. George prefers to visit science museums.
  4. Carl prefers to visit art museums.
Continue reading “Bag-of-words and k-mers”