How to do a Sankey Plot in Python

If you have been more than five seconds on r/dataisbeautiful/, you will have probably encountered a Sankey plot. Everyone uses to track their expenses, job searching and every multi step processes. Indeed, it is very suitable to visualize the progression of events and their outcome.
And in my opinion, they look great!

Therefore, let’s see how to do in Python:
Jupyter Notebook here

In matplotlib

Personally, in matplotlib they look awful.

An example of Sankey realized in matplotlib from the official website.

The above plot is probably closer to the original concept of a Sankey plot (originally invented in 1898), but it is not something I would use in a publication.

The other solution is to use the library Plotly.

In Plotly

Therefore, without further ado:

Continue reading “How to do a Sankey Plot in Python”

Create a weather forecast model with ML

How to create a simple weather forecast model using ML and how to find public available weather data with ERA5!

As a data scientist at Intellegens, I work on a plethora of different projects for different industries including materials, drug design, and chemicals. For one particular project looking I was in desperate need of weather data. I needed things like, temperature, humidity, rainfall, etc. Given the spacetime coordinates (date, time and GPS location). And this made me fall into a rabbit hole so deep, that I decided to share it with you!

Weather Data

I thought that finding an API that could give this type of information was going to be easy. I didn’t foresee weather data to be one of the most jealously kept types of data.

If you search for “free weather API”, you will see plenty of similar websites with different services but not actually free and even if there is a free package, it will never have historical weather records.You really need to search hard before finding the Climate Data Store (CDS) web site.

Continue reading “Create a weather forecast model with ML”

Alpha parameter doesn’t work on geom_rect!!! Sort of…

The parameter alpha in the R package ggplot2 is used to express the transparency of the fill colour of the function geom_

However for the function geom_rect it might not work as aspected.

In my latest work, I tried to combine different geom function but I was stuck when all was covered when I used geom_rect.
Let’s see an example:

If we plot the data using geom_jitter and geom_boxplot we obtain the plot:

Continue reading “Alpha parameter doesn’t work on geom_rect!!! Sort of…”

How to do a simple SVM classification in R and Python

Support Vector Machine (SVM) is a supervised learning model used for data classification and regression analysis.

It is one of the main machine learning methods used in modern-day artificial intelligence, and it has spread widely in all fields of research, not least, in bioinformatics.

The SVM classification method has, in general, a good classification efficiency, and it is flexible enough to be used with a great range of data.

Languages, like R or Python, offer several libraries to compute and work with SVMs in a simple and flexible way.

Let’s see how to create a classification of the database in R and Python using some basic code.

For this example we will use the Iris dataset.

Continue reading “How to do a simple SVM classification in R and Python”

Simple linear regression in Python

Let’s see a simple way to produce compute a linear regression using Python.

[code language="python"]

import matplotlib.pyplot as plt # To plot the graph

# Import a database to use in this case I choose the famous Iris database
import matplotlib.pyplot as plt
import pandas as pd

iris = pd.DataFrame(db.load_iris()['data'], 

Let’s take two columns from the database and plot it:

[code language="python"]
length=iris['petal length (cm)']
width=iris['petal width (cm)']

plt.scatter(length, width, c=list(iris.index))
Iris database, petal length vs. petal width

Now, to compute the linear regression we need scipy library:

[code language="python"]
from scipy import stats

# Here we compute the linear regression
slope, intercept, r_value, p_value, std_err = stats.linregress(length, width)

Not surprisingly, our R-squared value shows a really good fit:

[code language="python"]
r_value ** 2

# 0.9271098389904932

Let’s use the slope and intercept we got from the regression to plot predicted values vs. observed:

[code language="python"]
def predict(x):
    return slope * x + intercept

fitLine = predict(length)

plt.scatter(length, width)
plt.plot(length, fitLine, c='red')

Come fare un grafico a barre con ggplot2

Il pacchetto ggplot2 è una delle risorse più potenti per la creazione di grafici in R.

Anche se, ggplot2 ha una curva di apprendimento piuttosto alta che potrebbe scoraggiare chi inizia a usarlo, ma credetemi ne vale sicuramente la pena.

Qui voglio mostrare un paio di esempi dei grafici a barre:

# Per questi grafici abbiamo le informazioni relative a un database di topi immunizzati con due diversi antigeni OVA e CFA, 
# e sono riportati i tempi  dopo la vaccinazione. 
# Il gruppo di controllo ha il tempo zero perché quelli non sono stati immunizzati.

library(ggplot2) time_days=c(0,0,0,0,0,0,0,0,0,5,5,5,5,5,5,7,7,7,7,7,14,14,14,14,14,14,60,60,60,60,60,60,60,60,60,60,60)
db <- data.frame(time_days,antigens) # Il database come data.framse 
ggplot2::qplot( data = db, factor(time_days), fill = factor(time_days), geom = "bar" )+
 ggplot2::scale_fill_discrete(name='Time in days') +
 ggplot2::ggtitle('Group of mice per date of sacrifice') + 
 ggplot2::xlab('Time in days') +
ggplot2::ylab('Number of mice') 

Con il seguente risultato:

Con questa trama vediamo il numero di topi e il tempo in cui sono stati sacrificati.

Ora se vogliamo vedere sia il numero di topi che gli antigeni usati potremmo fare il seguente:

ggplot2::qplot( data=db, geom="bar", factor(time_days), fill=factor(antigens) ) +
 ggplot2::theme_classic() +
 ggplot2::ggtitle('Group of mice per date of sacrifice and antigens')+
 ggplot2::scale_fill_discrete(name='Antigens') +
 ggplot2::xlab('Time in days') +

Con il seguente risultato:

Labelled plot in ggplot2

ggplot2 is an amazing tool for plotting in R. One of the latest feature I found out it is the possibility to label the plot automatically. Let’s check it out:

data <- data.frame(class=c('A','B','C'), 
                   value=c(50, 30, 20))

g <- ggplot2::ggplot(data = data,
                ggplot2::aes(x = class,
                             y = value)
                ) +
  ggplot2::geom_bar(stat = "identity", 
                    ) +
                      size=5) +

ggplot2::ggsave('labelled_plot.pdf', plot=g, device = 'pdf')

This would be the result:

Very easy and fast!

bar-plots using ggplot2

The package ggplot2 is one of the most powerful resource for plot making available in R.

Although, it has with quite a learning curve, that could be intimidating, it is definitely worth the effort.

In here I want to show a couple of the first bar plot I ever made with the ggplot2 package:

Continue reading “bar-plots using ggplot2”