Bayes’ theorem

Overview

Bayes’ theorem is today considered one of the main theorems in statistics, and one of the most applied formulae in science.

its importance grew steadily until the middle of the last century. And it is now considered essential in all statistics courses, and applied in almost every field of research and not least in Bioinformatics, where it has been applied extensively to the biological system analysis.

At first glance, Bayes’ theorem can seem confusing, counterintuitive, and hard to grasp. We know that, for many, statistics is not intuitive, as with other aspects of mathematics.

However, if we analyse the thought processes leading Bayes to his theorem, we see that these are natural and logical ways of thinking.

Continue reading “Bayes’ theorem”

K-Means in R and Python

K-means is one of the most popular unsupervised algorithm for cluster analysis.

It cannot determine the number of clusters (k) within the dataset, therefore this has to be provided prior the initialisation of the algorithm.

The basic idea of the K-mean algorithm is that data points within a same group will gather on the plane near each other. Consequently close points are likely to belong to the same cluster.

The K-means algorithm performs better as compared to a hierarchical algorithm (HC) and the execution takes less time, with a time complexity of O(n^{2}), lower than other HC methods that have a time complexity between O(n^{3}) and O(n^{2}\log n ).

On the other hand, HC provides good quality results in respect to K– means.

In general, a K-means algorithm is good for a large dataset and HC is good for small datasets.

The algorithm

Given a set of points, where each point is an n-dimensional vector, the algorithm is able to separate the n points into k sets (with K \leq n, forming a number of clusters S ={S_{1},S_{2},\ …\ ,S_{k}} with centers \left( \mu_{1},\ …\ , \mu_{k} \right) , the Within-Cluster Sum of Squares formula (WCSS) is defined as:.

\\WCSS = \min \sum_{i=1}^{k}\sum_{x_{j}\in S_{i}} \left| x_{j}-\mu_{i}\right| ^{2}\\

The algorithm is composed of four steps:

Continue reading “K-Means in R and Python”

Introduction to Support Vector Machine (2)

Follow this link for part one of this introduction.

Finding the best hyperplane

In our example, we have used two-dimensional data: length vs. width of flower petals.

Because the hyperspace is two dimensional, the hyperplane must be one dimension smaller, thus, a line.

The formula of a line is: y=ax+b, and this formula is not dissimilar to the general formula of the hyperplane, that is defined as:

w^{t}x=0

The left-hand side of equation can be considered as the inner product of two vectors. Indeed, when we are dealing with points in space, as in this case, it is useful to use the concept of vectors.

The introduction of the concept of vectors should not be a surprise, given the name of the subject. However, explaining the entire vector algebra and the mathematics would be needlessly long, Here, I am giving a minimal amount of information in order to understand the concept:

Definition: A vector is defined as any quantity with a magnitude and a direction.

In other words, a vector exists between the origin O(0,0) and a point in the space.

Continue reading “Introduction to Support Vector Machine (2)”

Introduction to Support Vector Machine (1)

Support Vector Machine (SVM) is a supervised learning model used for data classification and regression analysis.

It is one of the main machine learning methods used in modern-day artificial intelligence, and it has spread widely in all fields of research, not least, in bioinformatics. SVM can be used for regression and classification analysis.

The SVM classification method has, in general, a good classification efficiency, and it is flexible enough to be used with a great range of data. Programming languages, like R or Python, offer several libraries to compute and work with SVMs in a simple and flexible way.

Brief history

The idea behind the Support Vector Machines (SVM) is to be able to find a separating line between two classes of elements as points in space.

Continue reading “Introduction to Support Vector Machine (1)”

How to do a simple SVM classification in R and Python

Support Vector Machine (SVM) is a supervised learning model used for data classification and regression analysis.

It is one of the main machine learning methods used in modern-day artificial intelligence, and it has spread widely in all fields of research, not least, in bioinformatics.

The SVM classification method has, in general, a good classification efficiency, and it is flexible enough to be used with a great range of data.

Languages, like R or Python, offer several libraries to compute and work with SVMs in a simple and flexible way.

Let’s see how to create a classification of the database in R and Python using some basic code.

For this example we will use the Iris dataset.

Continue reading “How to do a simple SVM classification in R and Python”