Alpha parameter doesn’t work on geom_rect!!! Sort of…

The parameter alpha in the R package ggplot2 is used to express the transparency of the fill colour of the function geom_

However for the function geom_rect it might not work as aspected.

In my latest work, I tried to combine different geom function but I was stuck when all was covered when I used geom_rect.
Let’s see an example:

library("dplyr")
library("ggplot2")
df = 
  data.frame(
    x = c(rep(1, 25), rep(2, 25), rep(3, 25)),
    y = c(sample(1:50, 25), sample(51:100, 25), sample(101:150, 25)),
    classes = c(rep("A", 25), rep("B", 25), rep("C", 25)))

head(df)
#  x  y classes
#1 1 45       A
#2 1  4       A
#3 1 41       A
#4 1 32       A
#5 1  8       A
36 1 14       A

If we plot the data using geom_jitter and geom_boxplot we obtain the plot:

ggplot(data = df,
       aes(x = x, y = y, colour = classes)) + 
  geom_jitter() +
  geom_boxplot(alpha = 0.2) +
  theme_minimal()
Continue reading “Alpha parameter doesn’t work on geom_rect!!! Sort of…”

K-Means in R and Python

K-means is one of the most popular unsupervised algorithm for cluster analysis.

It cannot determine the number of clusters (k) within the dataset, therefore this has to be provided prior the initialisation of the algorithm.

The basic idea of the K-mean algorithm is that data points within a same group will gather on the plane near each other. Consequently close points are likely to belong to the same cluster.

The K-means algorithm performs better as compared to a hierarchical algorithm (HC) and the execution takes less time, with a time complexity of O(n^{2}), lower than other HC methods that have a time complexity between O(n^{3}) and O(n^{2}\log n ).

On the other hand, HC provides good quality results in respect to K– means.

In general, a K-means algorithm is good for a large dataset and HC is good for small datasets.

The algorithm

Given a set of points, where each point is an n-dimensional vector, the algorithm is able to separate the n points into k sets (with K \leq n, forming a number of clusters S ={S_{1},S_{2},\ …\ ,S_{k}} with centers \left( \mu_{1},\ …\ , \mu_{k} \right) , the Within-Cluster Sum of Squares formula (WCSS) is defined as:.

\\WCSS = \min \sum_{i=1}^{k}\sum_{x_{j}\in S_{i}} \left| x_{j}-\mu_{i}\right| ^{2}\\

The algorithm is composed of four steps:

Continue reading “K-Means in R and Python”

Ho to do a simple SVM classification in R and python

Support Vector Machine (SVM) is a supervised learning model used for data classification and regression analysis.

It is one of the main machine learning methods used in modern-day artificial intelligence, and it has spread widely in all fields of research, not least, in bioinformatics.

The SVM classification method has, in general, a good classification efficiency, and it is flexible enough to be used with a great range of data.

Languages, like R or Python, offer several libraries to compute and work with SVMs in a simple and flexible way.

Let’s see how to create a classification of the database in R and Python using some basic code.

For this example we will use the Iris dataset.

Continue reading “Ho to do a simple SVM classification in R and python”

Testing in R

To test a given function in R, I use the package ‘testthat’.

A very nice intro to the package is present here: https://journal.r-project.org/archive/2011/RJ-2011-002/RJ-2011-002.pdf

However, in order to write good quality code and to do it only once, you got to carry out a paradigm shift in your writing procedure:

How to write a tests-oriented program

You might have the natural tendency to write the tests after your code.

However, this is not the best approach, indeed, after that, you might need to rewrite big part of the code, to make it more ‘testable’.

In order to avoid that, you need to write your test before the program.

Certainly, this is is a bit harder to implement, especially the first times.

Therefore, to make it easier, follow these guidelines:

Guidelines

  • Independent files:
    • One for the program, one for the tests
  • Code stile for the program:
    • Atomicity:
      • A single R file for each task/objective.
      • Single functions for each step of your algorithm. This will help later in the test
    • A main function to be invoked. This will list all the functions (steps) of the program.
  • Tests, in and out:
    • Within the program file
      • Tests that define and check the inputs
      • Tests that define and check the outputs
    • Within the test file
      • A test with correct inputs
      • Wrong inputs
      • Exceptions
Continue reading “Testing in R”

Come fare un grafico a barre con ggplot2

Il pacchetto ggplot2 è una delle risorse più potenti per la creazione di grafici in R.

Anche se, ggplot2 ha una curva di apprendimento piuttosto alta che potrebbe scoraggiare chi inizia a usarlo, ma credetemi ne vale sicuramente la pena.

Qui voglio mostrare un paio di esempi dei grafici a barre:

 
# Per questi grafici abbiamo le informazioni relative a un database di topi immunizzati con due diversi antigeni OVA e CFA, 
# e sono riportati i tempi  dopo la vaccinazione. 
# Il gruppo di controllo ha il tempo zero perché quelli non sono stati immunizzati.

library(ggplot2) time_days=c(0,0,0,0,0,0,0,0,0,5,5,5,5,5,5,7,7,7,7,7,14,14,14,14,14,14,60,60,60,60,60,60,60,60,60,60,60)
antigens=c('Control','Control','Control','Control','Control','Control','Control','Control','Control','OVA','OVA','OVA','CFA','CFA','CFA','OVA','OVA','OVA','CFA','CFA','OVA','OVA','OVA','CFA','CFA','CFA','OVA','OVA','OVA','CFA','CFA','CFA','OVA','OVA','OVA','CFA','CFA') 
db <- data.frame(time_days,antigens) # Il database come data.framse 
ggplot2::qplot( data = db, factor(time_days), fill = factor(time_days), geom = "bar" )+
 ggplot2::theme_classic()+
 ggplot2::scale_fill_discrete(name='Time in days') +
 ggplot2::ggtitle('Group of mice per date of sacrifice') + 
 ggplot2::xlab('Time in days') +
ggplot2::ylab('Number of mice') 


Con il seguente risultato:

Con questa trama vediamo il numero di topi e il tempo in cui sono stati sacrificati.

Ora se vogliamo vedere sia il numero di topi che gli antigeni usati potremmo fare il seguente:

 
ggplot2::qplot( data=db, geom="bar", factor(time_days), fill=factor(antigens) ) +
 ggplot2::theme_classic() +
 ggplot2::ggtitle('Group of mice per date of sacrifice and antigens')+
 ggplot2::scale_fill_discrete(name='Antigens') +
 ggplot2::xlab('Time in days') +
 ggplot2::ylab('Amount') 

Con il seguente risultato:

How to use R to create Latex documents

At work I have been asked to create a R code that could automatic generate a Latex document.

This was great! Latex was the first language that I learnt and it has a special place in my nerd heart.

It is nice to see that is possible to work with R for Latex and create a PDF document that can change in relation to the input given to the code.

All the code is present in my GitHub repository: https://github.com/MattiaCinelli/fromRtoLatex

Continue reading “How to use R to create Latex documents”