Hidden Markov Model applied to biological sequence. Part 2

This is part 2, for part 1 follow this link.

Application on Biological sequences

As seen thus far, MC and HMM are powerful methods that can be used for a large variety of purposes. However, we use a special case of HMM named Profile HMM for the study of biological sequences. In the following section, my description of this system should explain the reasoning behind the use of Profile HMM.

Analysis of a MSA

Let us consider a set of functionally related DNA sequences. Our objective is to characterise them as a “family”, and consequently identify other sequences that might belong to the same family [1].

We start by creating a multiple sequence alignment to highlight conserved positions:

ACAATG
TCAACTATC
ACACAGC
AGAATC
ACCGATC

It is possible to express this set of sequences as a regular expression. The family pattern for this set of sequences is:

[AT][CG][AC][ACGT]^{*}A[TG][GC] Continue reading “Hidden Markov Model applied to biological sequence. Part 2”

Hidden Markov Model applied to biological sequence. Part 1

Introduction on Markov Chains Models

The Markov Chains (MC) [1][2] and the Hidden Markov Model (HMM) [3] are powerful statistical models that can be applied in a variety of different fields, such as: protein homologies detection [4]; speech recognition [5]; language processing [6]; telecommunications [7]; and tracking animal behaviour [8][9].

HMM has been widely used in bioinformatics since its inception. It is most commonly applied to the analysis of sequences, specifically to DNA sequences [10], for their classification [11], or the detection of specific regions of the sequence, most notably the work made on CpG islands [12].

Overview

The Markov Chain models can be applied to all situations in which the history of a previous event is known, whether directly observable or not (hidden). In this way, the probability of transition from one event to another can be measured, and the probability of future events computed.

The Markov Chain models are discrete dynamical systems of finite states in which transitions from one state to another are based on a probabilistic model, rather than a deterministic one. It follows that the information for a generic state X of a chain at the time t is expressed by the probabilities of transition from the time: t-1.

Continue reading “Hidden Markov Model applied to biological sequence. Part 1”

Bag-of-words and k-mers

Bag Of Words

The bag of words (BOW) is a strategy usually adopted when we need to convert a document of strings, e.g., words in a book, into a vector of pure numbers that can be used for any sort of mathematical evaluation.

At first, this method was used in text and image classification, but it has recently been introduced into bioinformatics, and can be successfully applied to DNA/AA sequences’ repertoires.

With the BOW approach, we can redefine a highly complex document, such as a picture or a book, into a smaller set of low-level features, called codewords (also a codebook, dictionary or vocabulary).

The quality and origin of features is arbitrary, i.e. if we want to analyse a book or a series of books, we can choose as features all the words present inside the books, or the letters, or the combination of letters. As for the origin, the features can be all words present in the same books or all the words in the English dictionary, etc. As a result, the length of a codebook is defined by the number of features chosen.

  1. Tom likes to go to the cinema on Sunday.
  2. Martin likes to go to the cinema on Tuesday.
  3. George prefers to visit science museums.
  4. Carl prefers to visit art museums.
Continue reading “Bag-of-words and k-mers”

Introduction to Stoicism – pamphlet

Not to merely know, but to live philosophically

PDF version of this intro can be find here:

“It isn’t the events themselves that disturb people, only their judgements about them”

Epictetus

“If it is not right don’t do it, if it is not true don’t say it”

Marcus Aurelius

“Through my efforts, I gain the serenity to accept the things I cannot change; courage to change the things I can, and the wisdom to know the difference”

Secular serenity preyer

­What is Stoicism?

In today’s English, we refer to stoicism as: the ability or the predisposition of a person to endure pain or hardship without the displaying of feelings.

However, that is not Stoicism!

Stoicism is a philosophy, a school­ of thought founded in Athens about 2300 years ago by a man named Zeno of Citium. Zeno started his school by standing on a porch in the market and talking to anyone who happened by. The word for porch in Greek is stoa, and the followers of Zeno were known as Stoics.

Stoicism became the preeminent philosophy of ancient Greece and Rome; it penetrated all sectors and classes of the society such that two of the most important Stoic authors are the slave Epictetus and the emperor Marcus Aurelius.

Stoicism flourished for nearly 500 years, until the fall of the empire. It re-emerged occasionally in many philosophers and thinkers during the Renaissance when people returned to reason to find answers about how to live.

However, only recently has it been rediscovered as a philosophy to live by!

Continue reading “Introduction to Stoicism – pamphlet”

Stoicism and the Art of Happiness

The stoics lived a long time ago, but they had some startling insights into the human condition – insights which endure to this day.

In this meetup we’ll have an overlook on Stoicism as a whole. From its foundation by Zeno in 301 BC in Greece, its development in the following 500 years and its modern return.

Stoicism and the Art of Happiness

Wednesday, May 29, 2019, 7:30 PM

Station Tavern
2 Station Square Cambridge, GB

28 Members Went

The stoics lived a long time ago, but they had some startling insights into the human condition – insights which endure to this day. In this meetup we’ll have an overlook on Stoicism as a whole. From its foundation by Zeno in 301 BC in Greece, its development in the following 500 years and its modern return. Stoicisms is a body of thought with a si…

Check out this Meetup →

Stoicisms is a body of thought with a simple extraordinary goal: to provide a rational, healthy way of living in harmony with the nature of the universe and in respect of our relationships with each other and to possess a sense of direction in life!

Bayes’ theorem

Overview

Bayes’ theorem is today considered one of the main theorems in statistics, and one of the most applied formulae in science.

its importance grew steadily until the middle of the last century. And it is now considered essential in all statistics courses, and applied in almost every field of research and not least in Bioinformatics, where it has been applied extensively to the biological system analysis.

At first glance, Bayes’ theorem can seem confusing, counterintuitive, and hard to grasp. We know that, for many, statistics is not intuitive, as with other aspects of mathematics.

However, if we analyse the thought processes leading Bayes to his theorem, we see that these are natural and logical ways of thinking.

Continue reading “Bayes’ theorem”

K-Means in R and Python

K-means is one of the most popular unsupervised algorithm for cluster analysis.

It cannot determine the number of clusters (k) within the dataset, therefore this has to be provided prior the initialisation of the algorithm.

The basic idea of the K-mean algorithm is that data points within a same group will gather on the plane near each other. Consequently close points are likely to belong to the same cluster.

The K-means algorithm performs better as compared to a hierarchical algorithm (HC) and the execution takes less time, with a time complexity of O(n^{2}), lower than other HC methods that have a time complexity between O(n^{3}) and O(n^{2}\log n ).

On the other hand, HC provides good quality results in respect to K– means.

In general, a K-means algorithm is good for a large dataset and HC is good for small datasets.

The algorithm

Given a set of points, where each point is an n-dimensional vector, the algorithm is able to separate the n points into k sets (with K \leq n, forming a number of clusters S ={S_{1},S_{2},\ …\ ,S_{k}} with centers \left( \mu_{1},\ …\ , \mu_{k} \right) , the Within-Cluster Sum of Squares formula (WCSS) is defined as:.

\\WCSS = \min \sum_{i=1}^{k}\sum_{x_{j}\in S_{i}} \left| x_{j}-\mu_{i}\right| ^{2}\\

The algorithm is composed of four steps:

Continue reading “K-Means in R and Python”

Introduction to Support Vector Machine (2)

Follow this link for part one of this introduction.

Finding the best hyperplane

In our example, we have used two-dimensional data: length vs. width of flower petals.

Because the hyperspace is two dimensional, the hyperplane must be one dimension smaller, thus, a line.

The formula of a line is: y=ax+b, and this formula is not dissimilar to the general formula of the hyperplane, that is defined as:

w^{t}x=0

The left-hand side of equation can be considered as the inner product of two vectors. Indeed, when we are dealing with points in space, as in this case, it is useful to use the concept of vectors.

The introduction of the concept of vectors should not be a surprise, given the name of the subject. However, explaining the entire vector algebra and the mathematics would be needlessly long, Here, I am giving a minimal amount of information in order to understand the concept:

Definition: A vector is defined as any quantity with a magnitude and a direction.

In other words, a vector exists between the origin O(0,0) and a point in the space.

Continue reading “Introduction to Support Vector Machine (2)”

Introduction to Support Vector Machine (1)

Support Vector Machine (SVM) is a supervised learning model used for data classification and regression analysis.

It is one of the main machine learning methods used in modern-day artificial intelligence, and it has spread widely in all fields of research, not least, in bioinformatics. SVM can be used for regression and classification analysis.

The SVM classification method has, in general, a good classification efficiency, and it is flexible enough to be used with a great range of data. Programming languages, like R or Python, offer several libraries to compute and work with SVMs in a simple and flexible way.

Brief history

The idea behind the Support Vector Machines (SVM) is to be able to find a separating line between two classes of elements as points in space.

Continue reading “Introduction to Support Vector Machine (1)”

Ho to do a simple SVM classification in R and python

Support Vector Machine (SVM) is a supervised learning model used for data classification and regression analysis.

It is one of the main machine learning methods used in modern-day artificial intelligence, and it has spread widely in all fields of research, not least, in bioinformatics.

The SVM classification method has, in general, a good classification efficiency, and it is flexible enough to be used with a great range of data.

Languages, like R or Python, offer several libraries to compute and work with SVMs in a simple and flexible way.

Let’s see how to create a classification of the database in R and Python using some basic code.

For this example we will use the Iris dataset.

Continue reading “Ho to do a simple SVM classification in R and python”