K-means is one of the most popular unsupervised algorithm for cluster analysis.
It cannot determine the number of clusters (k) within the dataset, therefore this has to be provided prior the initialisation of the algorithm.
The basic idea of the K-mean algorithm is that data points within a same group will gather on the plane near each other. Consequently close points are likely to belong to the same cluster.
The K-means algorithm performs better as compared to a hierarchical algorithm (HC) and the execution takes less time, with a time complexity of O(n^{2}), lower than other HC methods that have a time complexity between O(n^{3}) and O(n^{2}\log n ).
On the other hand, HC provides good quality results in respect to K– means.
In general, a K-means algorithm is good for a large dataset and HC is good for small datasets.
The
algorithm
Given a set of points, where each point is an n-dimensional vector, the algorithm is able to separate the n points into k sets (with K \leq n, forming a number of clusters S ={S_{1},S_{2},\ …\ ,S_{k}} with centers \left( \mu_{1},\ …\ , \mu_{k} \right) , the Within-Cluster Sum of Squares formula (WCSS) is defined as:.
Support Vector Machine (SVM) is a supervised learning model used for data classification and regression analysis.
It is one of the main machine learning methods used in modern-day artificial intelligence, and it has spread widely in all fields of research, not least, in bioinformatics.
The SVM classification method has, in general, a good classification efficiency, and it is flexible enough to be used with a great range of data.
Languages, like R or Python, offer several libraries to compute and work with SVMs in a simple and flexible way.
Let’s see how to create a classification of the database in R and Python using some basic code.
Il pacchetto ggplot2 è una delle risorse più potenti per la creazione di grafici in R.
Anche se, ggplot2 ha una curva di apprendimento piuttosto alta che potrebbe scoraggiare chi inizia a usarlo, ma credetemi ne vale sicuramente la pena.
Qui voglio mostrare un paio di esempi dei grafici a barre:
# Per questi grafici abbiamo le informazioni relative a un database di topi immunizzati con due diversi antigeni OVA e CFA,
# e sono riportati i tempi dopo la vaccinazione.
# Il gruppo di controllo ha il tempo zero perché quelli non sono stati immunizzati.
library(ggplot2) time_days=c(0,0,0,0,0,0,0,0,0,5,5,5,5,5,5,7,7,7,7,7,14,14,14,14,14,14,60,60,60,60,60,60,60,60,60,60,60)
antigens=c('Control','Control','Control','Control','Control','Control','Control','Control','Control','OVA','OVA','OVA','CFA','CFA','CFA','OVA','OVA','OVA','CFA','CFA','OVA','OVA','OVA','CFA','CFA','CFA','OVA','OVA','OVA','CFA','CFA','CFA','OVA','OVA','OVA','CFA','CFA')
db <- data.frame(time_days,antigens) # Il database come data.framse
ggplot2::qplot( data = db, factor(time_days), fill = factor(time_days), geom = "bar" )+
ggplot2::theme_classic()+
ggplot2::scale_fill_discrete(name='Time in days') +
ggplot2::ggtitle('Group of mice per date of sacrifice') +
ggplot2::xlab('Time in days') +
ggplot2::ylab('Number of mice')
Con il seguente risultato:
Con questa trama vediamo il numero di topi e il tempo in cui sono stati sacrificati.
Ora se vogliamo vedere sia il numero di topi che gli antigeni usati potremmo fare il seguente:
ggplot2::qplot( data=db, geom="bar", factor(time_days), fill=factor(antigens) ) +
ggplot2::theme_classic() +
ggplot2::ggtitle('Group of mice per date of sacrifice and antigens')+
ggplot2::scale_fill_discrete(name='Antigens') +
ggplot2::xlab('Time in days') +
ggplot2::ylab('Amount')