How Probability Calibration Works

Probability calibration is the process of calibrating an ML model to return the true likelihood of an event. This is necessary when we need the probability of the event in question rather than its classification.

Image that you have two models to predict rainy days, Model A and Model B. Both models have an accuracy of 0.8. And indeed, for every 10 rainy days, both mislabelled two days.
But if we look at the probability connected to each prediction, we can see that Model A reports a probability of 80%, while Model B of 100%.

This means that model B is 100% sure that it will rain, even when it will not, while model A is only 80% sure. It appears that model B is overconfident with its prediction, while model A is more cautious.

And it’s this level of confidence in predictions that makes Model A a more reliable model with respect to Model B; Model A is better despite the two models having the same accuracy.

Model B offers a more yes-or-no prediction, while Model A tells us the true likelihood of the event. And in real life, when we look at the weather forecast, we get the prediction and its probability, leaving us to decide if, for example, a 30% risk of rain is acceptable or not.

Continue reading “How Probability Calibration Works”

Logging in Python

You know when you have coded your biggest project and every time it runs you can barely figure out what is doing, only by reading a series of print statements and the creation of strategically saved files?

Well if that is the case, you ought to learn logging and step up your game. 

With a proper system of logging. you will have a consistent, ordered and a more reliable way to understand your own code, to time and track its progression and capture bugs easily.

Let’s break down the advantages of logging:

  1. Formatting: Logging allows you to standardize every message using a format of your choosing.
  2. Time tracking: Alongside the message you can add the time when it is generated. 
  3. Compact: All messages are gathered in files, you don’t need to scroll up continuously. 
  4. Versatility: Print does not work everywhere (i.e., objects without __str__ methods).
  5. Flexibility: Logging allows different levels of importance to your messages so you regulate what to show.

With all of this, you won’t be the only one who can understand your code.

Let’s start!

Continue reading “Logging in Python”

How to do a Sankey Plot in Python

If you have been more than five seconds on r/dataisbeautiful/, you will have probably encountered a Sankey plot. Everyone uses to track their expenses, job searching and every multi step processes. Indeed, it is very suitable to visualize the progression of events and their outcome.
And in my opinion, they look great!

Therefore, let’s see how to do in Python:
Jupyter Notebook here

In matplotlib

Personally, in matplotlib they look awful.

An example of Sankey realized in matplotlib from the official website.

The above plot is probably closer to the original concept of a Sankey plot (originally invented in 1898), but it is not something I would use in a publication.

The other solution is to use the library Plotly.

In Plotly

Therefore, without further ado:

Continue reading “How to do a Sankey Plot in Python”