How Probability Calibration Works

Probability calibration is the process of calibrating an ML model to return the true likelihood of an event. This is necessary when we need the probability of the event in question rather than its classification.

Image that you have two models to predict rainy days, Model A and Model B. Both models have an accuracy of 0.8. And indeed, for every 10 rainy days, both mislabelled two days.
But if we look at the probability connected to each prediction, we can see that Model A reports a probability of 80%, while Model B of 100%.

This means that model B is 100% sure that it will rain, even when it will not, while model A is only 80% sure. It appears that model B is overconfident with its prediction, while model A is more cautious.

And it’s this level of confidence in predictions that makes Model A a more reliable model with respect to Model B; Model A is better despite the two models having the same accuracy.

Model B offers a more yes-or-no prediction, while Model A tells us the true likelihood of the event. And in real life, when we look at the weather forecast, we get the prediction and its probability, leaving us to decide if, for example, a 30% risk of rain is acceptable or not.

Continue reading “How Probability Calibration Works”

Logging in Python

You know when you have coded your biggest project and every time it runs you can barely figure out what is doing, only by reading a series of print statements and the creation of strategically saved files?

Well if that is the case, you ought to learn logging and step up your game. 

With a proper system of logging. you will have a consistent, ordered and a more reliable way to understand your own code, to time and track its progression and capture bugs easily.

Let’s break down the advantages of logging:

  1. Formatting: Logging allows you to standardize every message using a format of your choosing.
  2. Time tracking: Alongside the message you can add the time when it is generated. 
  3. Compact: All messages are gathered in files, you don’t need to scroll up continuously. 
  4. Versatility: Print does not work everywhere (i.e., objects without __str__ methods).
  5. Flexibility: Logging allows different levels of importance to your messages so you regulate what to show.

With all of this, you won’t be the only one who can understand your code.

Let’s start!

Continue reading “Logging in Python”

How to do a Sankey Plot in Python

If you have been more than five seconds on r/dataisbeautiful/, you will have probably encountered a Sankey plot. Everyone uses to track their expenses, job searching and every multi step processes. Indeed, it is very suitable to visualize the progression of events and their outcome.
And in my opinion, they look great!

Therefore, let’s see how to do in Python:
Jupyter Notebook here

In matplotlib

Personally, in matplotlib they look awful.

An example of Sankey realized in matplotlib from the official website.

The above plot is probably closer to the original concept of a Sankey plot (originally invented in 1898), but it is not something I would use in a publication.

The other solution is to use the library Plotly.

In Plotly

Therefore, without further ado:

Continue reading “How to do a Sankey Plot in Python”

Create a weather forecast model with ML

How to create a simple weather forecast model using ML and how to find public available weather data with ERA5!

As a data scientist at Intellegens, I work on a plethora of different projects for different industries including materials, drug design, and chemicals. For one particular project looking I was in desperate need of weather data. I needed things like, temperature, humidity, rainfall, etc. Given the spacetime coordinates (date, time and GPS location). And this made me fall into a rabbit hole so deep, that I decided to share it with you!

Weather Data

I thought that finding an API that could give this type of information was going to be easy. I didn’t foresee weather data to be one of the most jealously kept types of data.

If you search for “free weather API”, you will see plenty of similar websites with different services but not actually free and even if there is a free package, it will never have historical weather records.You really need to search hard before finding the Climate Data Store (CDS) web site.

Continue reading “Create a weather forecast model with ML”

Testing in Python

After having seen how to test in R.

Let’s see how to do the same in Python:

Writing a tests-oriented program

A good practice demands that we should try to write our test before we code the program we intended to.

Or at least, try to write the code in a way that is easier to test in the future.
Fighting our natural tendency to write the code you desperately want to write and then the tests.

To do that, follow these guidelines:

Guidelines

Continue reading “Testing in Python”

K-Means in R and Python

K-means is one of the most popular unsupervised algorithm for cluster analysis.

It cannot determine the number of clusters (k) within the dataset, therefore this has to be provided prior the initialisation of the algorithm.

The basic idea of the K-mean algorithm is that data points within a same group will gather on the plane near each other. Consequently close points are likely to belong to the same cluster.

The K-means algorithm performs better as compared to a hierarchical algorithm (HC) and the execution takes less time, with a time complexity of O(n^{2}), lower than other HC methods that have a time complexity between O(n^{3}) and O(n^{2}\log n ).

On the other hand, HC provides good quality results in respect to K– means.

In general, a K-means algorithm is good for a large dataset and HC is good for small datasets.

The algorithm

Given a set of points, where each point is an n-dimensional vector, the algorithm is able to separate the n points into k sets (with K \leq n, forming a number of clusters S ={S_{1},S_{2},\ …\ ,S_{k}} with centers \left( \mu_{1},\ …\ , \mu_{k} \right) , the Within-Cluster Sum of Squares formula (WCSS) is defined as:.

\\WCSS = \min \sum_{i=1}^{k}\sum_{x_{j}\in S_{i}} \left| x_{j}-\mu_{i}\right| ^{2}\\

The algorithm is composed of four steps:

Continue reading “K-Means in R and Python”

How to do a simple SVM classification in R and Python

Support Vector Machine (SVM) is a supervised learning model used for data classification and regression analysis.

It is one of the main machine learning methods used in modern-day artificial intelligence, and it has spread widely in all fields of research, not least, in bioinformatics.

The SVM classification method has, in general, a good classification efficiency, and it is flexible enough to be used with a great range of data.

Languages, like R or Python, offer several libraries to compute and work with SVMs in a simple and flexible way.

Let’s see how to create a classification of the database in R and Python using some basic code.

For this example we will use the Iris dataset.

Continue reading “How to do a simple SVM classification in R and Python”

Simple linear regression in Python

Let’s see a simple way to produce compute a linear regression using Python.

[code language="python"]

import matplotlib.pyplot as plt # To plot the graph

# Import a database to use in this case I choose the famous Iris database
import matplotlib.pyplot as plt
import pandas as pd

iris = pd.DataFrame(db.load_iris()['data'], 
                    columns=list(db.load_iris()['feature_names']), 
                    index=list(db.load_iris()['target']))
[/code]

Let’s take two columns from the database and plot it:

[code language="python"]
length=iris['petal length (cm)']
width=iris['petal width (cm)']

plt.scatter(length, width, c=list(iris.index))
plt.show()
[/code]
Iris database, petal length vs. petal width

Now, to compute the linear regression we need scipy library:

[code language="python"]
from scipy import stats

# Here we compute the linear regression
slope, intercept, r_value, p_value, std_err = stats.linregress(length, width)
[/code]

Not surprisingly, our R-squared value shows a really good fit:

[code language="python"]
r_value ** 2

# 0.9271098389904932
[/code]

Let’s use the slope and intercept we got from the regression to plot predicted values vs. observed:

[code language="python"]
def predict(x):
    return slope * x + intercept

fitLine = predict(length)

plt.scatter(length, width)
plt.plot(length, fitLine, c='red')
plt.show()
[/code]

Tutorial on Luigi, part 3 pipeline: input() and output()

In the last article we saw some small example of a Luigi pipeline, in this article I want to explore how make the different Tasks to comunicate and pass information thus LocalTarget between them.

We already saw that we can use parameters to pass info from a Task to the next, and other nice way is to use the methods: input() and output().

The use of self.input()

Let’s see an example:

The value for self.input() comes from the result of the method output() inside the Task called by requires() in this case it would be the method MakeDirectory().

Continue reading “Tutorial on Luigi, part 3 pipeline: input() and output()”