How Probability Calibration Works

Probability calibration is the process of calibrating an ML model to return the true likelihood of an event. This is necessary when we need the probability of the event in question rather than its classification.

Image that you have two models to predict rainy days, Model A and Model B. Both models have an accuracy of 0.8. And indeed, for every 10 rainy days, both mislabelled two days.
But if we look at the probability connected to each prediction, we can see that Model A reports a probability of 80%, while Model B of 100%.

This means that model B is 100% sure that it will rain, even when it will not, while model A is only 80% sure. It appears that model B is overconfident with its prediction, while model A is more cautious.

And it’s this level of confidence in predictions that makes Model A a more reliable model with respect to Model B; Model A is better despite the two models having the same accuracy.

Model B offers a more yes-or-no prediction, while Model A tells us the true likelihood of the event. And in real life, when we look at the weather forecast, we get the prediction and its probability, leaving us to decide if, for example, a 30% risk of rain is acceptable or not.

Continue reading “How Probability Calibration Works”

Logging in Python

You know when you have coded your biggest project and every time it runs you can barely figure out what is doing, only by reading a series of print statements and the creation of strategically saved files?

Well if that is the case, you ought to learn logging and step up your game. 

With a proper system of logging. you will have a consistent, ordered and a more reliable way to understand your own code, to time and track its progression and capture bugs easily.

Let’s break down the advantages of logging:

  1. Formatting: Logging allows you to standardize every message using a format of your choosing.
  2. Time tracking: Alongside the message you can add the time when it is generated. 
  3. Compact: All messages are gathered in files, you don’t need to scroll up continuously. 
  4. Versatility: Print does not work everywhere (i.e., objects without __str__ methods).
  5. Flexibility: Logging allows different levels of importance to your messages so you regulate what to show.

With all of this, you won’t be the only one who can understand your code.

Let’s start!

Continue reading “Logging in Python”

How to do a Sankey Plot in Python

If you have been more than five seconds on r/dataisbeautiful/, you will have probably encountered a Sankey plot. Everyone uses to track their expenses, job searching and every multi step processes. Indeed, it is very suitable to visualize the progression of events and their outcome.
And in my opinion, they look great!

Therefore, let’s see how to do in Python:
Jupyter Notebook here

In matplotlib

Personally, in matplotlib they look awful.

An example of Sankey realized in matplotlib from the official website.

The above plot is probably closer to the original concept of a Sankey plot (originally invented in 1898), but it is not something I would use in a publication.

The other solution is to use the library Plotly.

In Plotly

Therefore, without further ado:

Continue reading “How to do a Sankey Plot in Python”

Create a weather forecast model with ML

How to create a simple weather forecast model using ML and how to find public available weather data with ERA5!

As a data scientist at Intellegens, I work on a plethora of different projects for different industries including materials, drug design, and chemicals. For one particular project looking I was in desperate need of weather data. I needed things like, temperature, humidity, rainfall, etc. Given the spacetime coordinates (date, time and GPS location). And this made me fall into a rabbit hole so deep, that I decided to share it with you!

Weather Data

I thought that finding an API that could give this type of information was going to be easy. I didn’t foresee weather data to be one of the most jealously kept types of data.

If you search for “free weather API”, you will see plenty of similar websites with different services but not actually free and even if there is a free package, it will never have historical weather records.You really need to search hard before finding the Climate Data Store (CDS) web site.

Continue reading “Create a weather forecast model with ML”

Testing in Python

After having seen how to test in R.

Let’s see how to do the same in Python:

Writing a tests-oriented program

A good practice demand that we should try to write our test before we code the program we intended to.

Or at least, try to write the code in a way that is easier to test in the future.
Fighting our natural tendency to write the code you desperately want to writhe and then the tests.

To do that, follow these guidelines:

Guidelines

Continue reading “Testing in Python”

Alpha parameter doesn’t work on geom_rect!!! Sort of…

The parameter alpha in the R package ggplot2 is used to express the transparency of the fill colour of the function geom_

However for the function geom_rect it might not work as aspected.

In my latest work, I tried to combine different geom function but I was stuck when all was covered when I used geom_rect.
Let’s see an example:

If we plot the data using geom_jitter and geom_boxplot we obtain the plot:

Continue reading “Alpha parameter doesn’t work on geom_rect!!! Sort of…”

Hidden Markov Model applied to biological sequence. Part 2

This is part 2, for part 1 follow this link.

Application on Biological sequences

As seen thus far, MC and HMM are powerful methods that can be used for a large variety of purposes. However, we use a special case of HMM named Profile HMM for the study of biological sequences. In the following section, my description of this system should explain the reasoning behind the use of Profile HMM.

Analysis of a MSA

Let us consider a set of functionally related DNA sequences. Our objective is to characterise them as a “family”, and consequently identify other sequences that might belong to the same family [1].

We start by creating a multiple sequence alignment to highlight conserved positions:

ACAATG
TCAACTATC
ACACAGC
AGAATC
ACCGATC

It is possible to express this set of sequences as a regular expression. The family pattern for this set of sequences is:

[AT][CG][AC][ACGT]^{*}A[TG][GC] Continue reading “Hidden Markov Model applied to biological sequence. Part 2”

Hidden Markov Model applied to biological sequence. Part 1

Introduction on Markov Chains Models

The Markov Chains (MC) [1][2] and the Hidden Markov Model (HMM) [3] are powerful statistical models that can be applied in a variety of different fields, such as: protein homologies detection [4]; speech recognition [5]; language processing [6]; telecommunications [7]; and tracking animal behaviour [8][9].

HMM has been widely used in bioinformatics since its inception. It is most commonly applied to the analysis of sequences, specifically to DNA sequences [10], for their classification [11], or the detection of specific regions of the sequence, most notably the work made on CpG islands [12].

Overview

The Markov Chain models can be applied to all situations in which the history of a previous event is known, whether directly observable or not (hidden). In this way, the probability of transition from one event to another can be measured, and the probability of future events computed.

The Markov Chain models are discrete dynamical systems of finite states in which transitions from one state to another are based on a probabilistic model, rather than a deterministic one. It follows that the information for a generic state X of a chain at the time t is expressed by the probabilities of transition from the time: t-1.

Continue reading “Hidden Markov Model applied to biological sequence. Part 1”

Bag-of-words and k-mers

Bag Of Words

The bag of words (BOW) is a strategy usually adopted when we need to convert a document of strings, e.g., words in a book, into a vector of pure numbers that can be used for any sort of mathematical evaluation.

At first, this method was used in text and image classification, but it has recently been introduced into bioinformatics, and can be successfully applied to DNA/AA sequences’ repertoires.

With the BOW approach, we can redefine a highly complex document, such as a picture or a book, into a smaller set of low-level features, called codewords (also a codebook, dictionary or vocabulary).

The quality and origin of features is arbitrary, i.e. if we want to analyse a book or a series of books, we can choose as features all the words present inside the books, or the letters, or the combination of letters. As for the origin, the features can be all words present in the same books or all the words in the English dictionary, etc. As a result, the length of a codebook is defined by the number of features chosen.

  1. Tom likes to go to the cinema on Sunday.
  2. Martin likes to go to the cinema on Tuesday.
  3. George prefers to visit science museums.
  4. Carl prefers to visit art museums.
Continue reading “Bag-of-words and k-mers”

Bayes’ theorem

Overview

Bayes’ theorem is today considered one of the main theorems in statistics, and one of the most applied formulae in science.

its importance grew steadily until the middle of the last century. And it is now considered essential in all statistics courses, and applied in almost every field of research and not least in Bioinformatics, where it has been applied extensively to the biological system analysis.

At first glance, Bayes’ theorem can seem confusing, counterintuitive, and hard to grasp. We know that, for many, statistics is not intuitive, as with other aspects of mathematics.

However, if we analyse the thought processes leading Bayes to his theorem, we see that these are natural and logical ways of thinking.

Continue reading “Bayes’ theorem”