This is part 2, for part 1 follow this link.
Application on Biological sequences
As seen thus far, MC and HMM are powerful methods that can
be used for a large variety of purposes. However, we use a special case of HMM
named Profile HMM for the study of biological sequences. In the following
section, my description of this system should explain the reasoning behind the
use of Profile HMM.
Analysis of a MSA
Let us consider a set of functionally related DNA sequences. Our objective is to characterise them as a “family”, and consequently identify other sequences that might belong to the same family .
We start by creating a multiple sequence alignment to highlight conserved positions:
It is possible to express this set of sequences as a regular
expression. The family pattern for this set of sequences is:
Continue reading “Hidden Markov Model applied to biological sequence. Part 2”
Introduction on Markov Chains Models
The Markov Chains (MC)  and the Hidden Markov Model (HMM)  are powerful statistical models that can be applied in a variety of different fields, such as: protein homologies detection ; speech recognition ; language processing ; telecommunications ; and tracking animal behaviour .
HMM has been widely used in bioinformatics since its inception. It is most commonly applied to the analysis of sequences, specifically to DNA sequences , for their classification , or the detection of specific regions of the sequence, most notably the work made on CpG islands .
The Markov Chain models can be applied to all situations in
which the history of a previous event is known, whether directly observable or
not (hidden). In this way, the probability of transition from one event to
another can be measured, and the probability of future events computed.
The Markov Chain models are discrete dynamical systems of
finite states in which transitions from one state to another are based on a
probabilistic model, rather than a deterministic one. It follows that the
information for a generic state X of a chain at the time t
is expressed by the probabilities of transition from the time: t-1.
Continue reading “Hidden Markov Model applied to biological sequence. Part 1”
Bag Of Words
The bag of words (BOW) is a strategy usually adopted when we need to convert a document of strings, e.g., words in a book, into a vector of pure numbers that can be used for any sort of mathematical evaluation.
At first, this method was used in text and image classification, but it has recently been introduced into bioinformatics, and can be successfully applied to DNA/AA sequences’ repertoires.
With the BOW approach, we can redefine a highly complex document, such as a picture or a book, into a smaller set of low-level features, called codewords (also a codebook, dictionary or vocabulary).
The quality and origin of features is arbitrary, i.e. if we want to analyse a book or a series of books, we can choose as features all the words present inside the books, or the letters, or the combination of letters. As for the origin, the features can be all words present in the same books or all the words in the English dictionary, etc. As a result, the length of a codebook is defined by the number of features chosen.
Continue reading “Bag-of-words and k-mers”
- Tom likes to go to the cinema on Sunday.
- Martin likes to go to the cinema on Tuesday.
- George prefers to visit science museums.
- Carl prefers to visit art museums.