If you have been more than five seconds on r/dataisbeautiful/, you will have probably encountered a Sankey plot. Everyone uses to track their expenses, job searching and every multi step processes. Indeed, it is very suitable to visualize the progression of events and their outcome. And in my opinion, they look great!
Therefore, let’s see how to do in Python: Jupyter Notebook here
An example of Sankey realized in matplotlib from the official website.
The above plot is probably closer to the original concept of a Sankey plot (originally invented in 1898), but it is not something I would use in a publication.
How to create a simple weather forecast model using ML and how to find public available weather data with ERA5!
As a data scientist at Intellegens, I work on a plethora of different projects for different industries including materials, drug design, and chemicals. For one particular project looking I was in desperate need of weather data. I needed things like, temperature, humidity, rainfall, etc. Given the spacetime coordinates (date, time and GPS location). And this made me fall into a rabbit hole so deep, that I decided to share it with you!
Weather Data
I thought that finding an API that could give this type of information was going to be easy. I didn’t foresee weather data to be one of the most jealously kept types of data.
If you search for “free weather API”, you will see plenty of similar websites with different services but not actually free and even if there is a free package, it will never have historical weather records.You really need to search hard before finding the Climate Data Store (CDS) web site.
Support Vector Machine (SVM) is a supervised learning model used for data classification and regression analysis.
It is one of the main machine learning methods used in modern-day artificial intelligence, and it has spread widely in all fields of research, not least, in bioinformatics.
The SVM classification method has, in general, a good classification efficiency, and it is flexible enough to be used with a great range of data.
Languages, like R or Python, offer several libraries to compute and work with SVMs in a simple and flexible way.
Let’s see how to create a classification of the database in R and Python using some basic code.
Let’s see a simple way to produce compute a linear regression using Python.
[code language="python"]
import matplotlib.pyplot as plt # To plot the graph
# Import a database to use in this case I choose the famous Iris database
import matplotlib.pyplot as plt
import pandas as pd
iris = pd.DataFrame(db.load_iris()['data'],
columns=list(db.load_iris()['feature_names']),
index=list(db.load_iris()['target']))
[/code]
Let’s take two columns from the database and plot it:
Now, to compute the linear regression we need scipy library:
[code language="python"]
from scipy import stats
# Here we compute the linear regression
slope, intercept, r_value, p_value, std_err = stats.linregress(length, width)
[/code]
Not surprisingly, our R-squared value shows a really good fit:
Il pacchetto ggplot2 è una delle risorse più potenti per la creazione di grafici in R.
Anche se, ggplot2 ha una curva di apprendimento piuttosto alta che potrebbe scoraggiare chi inizia a usarlo, ma credetemi ne vale sicuramente la pena.
Qui voglio mostrare un paio di esempi dei grafici a barre:
# Per questi grafici abbiamo le informazioni relative a un database di topi immunizzati con due diversi antigeni OVA e CFA,
# e sono riportati i tempi dopo la vaccinazione.
# Il gruppo di controllo ha il tempo zero perché quelli non sono stati immunizzati.
library(ggplot2) time_days=c(0,0,0,0,0,0,0,0,0,5,5,5,5,5,5,7,7,7,7,7,14,14,14,14,14,14,60,60,60,60,60,60,60,60,60,60,60)
antigens=c('Control','Control','Control','Control','Control','Control','Control','Control','Control','OVA','OVA','OVA','CFA','CFA','CFA','OVA','OVA','OVA','CFA','CFA','OVA','OVA','OVA','CFA','CFA','CFA','OVA','OVA','OVA','CFA','CFA','CFA','OVA','OVA','OVA','CFA','CFA')
db <- data.frame(time_days,antigens) # Il database come data.framse
ggplot2::qplot( data = db, factor(time_days), fill = factor(time_days), geom = "bar" )+
ggplot2::theme_classic()+
ggplot2::scale_fill_discrete(name='Time in days') +
ggplot2::ggtitle('Group of mice per date of sacrifice') +
ggplot2::xlab('Time in days') +
ggplot2::ylab('Number of mice')
Con il seguente risultato:
Con questa trama vediamo il numero di topi e il tempo in cui sono stati sacrificati.
Ora se vogliamo vedere sia il numero di topi che gli antigeni usati potremmo fare il seguente:
ggplot2::qplot( data=db, geom="bar", factor(time_days), fill=factor(antigens) ) +
ggplot2::theme_classic() +
ggplot2::ggtitle('Group of mice per date of sacrifice and antigens')+
ggplot2::scale_fill_discrete(name='Antigens') +
ggplot2::xlab('Time in days') +
ggplot2::ylab('Amount')
ggplot2 is an amazing tool for plotting in R. One of the latest feature I found out it is the possibility to label the plot automatically. Let’s check it out: