## General Project Overview

In this Kaggle competition set of 5,863 chest X-ray images (anterior-posterior) were selected from retrospective cohorts of paediatric patients, between the age of one to five years old, in Guangzhou. All chest X-ray imaging was performed as part of the patient’s routine clinical care.

All chest radiographs were screened by two expert physicians for quality control and removing all low quality or unreadable scans.

In the picture below are show the three type of chest X-ray present in the database:

On the left-hand side it is present an image of a healthy individual with clear lungs and no areas of abnormal opacification. In the middle and right-hand side images are present a patient affected by bacterial and viral pneumonia respectively. The latter, presents a more diffuse ‘‘interstitial’’ pattern in both lungs. while the first typically exhibits a focal lobar consolidation, in this case in the right upper lobe (white arrows).

For this particular challenge we are requested to discern only healthy vs pneumonia affected chest X-ray.

Continue reading “Keggle Challenge: Chest X-Ray Images (Pneumonia)”

## How Probability Calibration Works

Probability calibration is the process of calibrating an ML model to return the true likelihood of an event. This is necessary when we need the probability of the event in question rather than its classification.

Image that you have two models to predict rainy days, Model A and Model B. Both models have an accuracy of 0.8. And indeed, for every 10 rainy days, both mislabelled two days.
But if we look at the probability connected to each prediction, we can see that Model A reports a probability of 80%, while Model B of 100%.

This means that model B is 100% sure that it will rain, even when it will not, while model A is only 80% sure. It appears that model B is overconfident with its prediction, while model A is more cautious.

And it’s this level of confidence in predictions that makes Model A a more reliable model with respect to Model B; Model A is better despite the two models having the same accuracy.

Model B offers a more yes-or-no prediction, while Model A tells us the true likelihood of the event. And in real life, when we look at the weather forecast, we get the prediction and its probability, leaving us to decide if, for example, a 30% risk of rain is acceptable or not.

Continue reading “How Probability Calibration Works”

## Logging in Python

You know when you have coded your biggest project and every time it runs you can barely figure out what is doing, only by reading a series of print statements and the creation of strategically saved files?

Well if that is the case, you ought to learn logging and step up your game.

With a proper system of logging. you will have a consistent, ordered and a more reliable way to understand your own code, to time and track its progression and capture bugs easily.

Let’s break down the advantages of logging:

1. Formatting: Logging allows you to standardize every message using a format of your choosing.
2. Time tracking: Alongside the message you can add the time when it is generated.
3. Compact: All messages are gathered in files, you don’t need to scroll up continuously.
4. Versatility: Print does not work everywhere (i.e., objects without __str__ methods).
5. Flexibility: Logging allows different levels of importance to your messages so you regulate what to show.

With all of this, you won’t be the only one who can understand your code.

Let’s start!

## How to do a Sankey Plot in Python

If you have been more than five seconds on r/dataisbeautiful/, you will have probably encountered a Sankey plot. Everyone uses to track their expenses, job searching and every multi step processes. Indeed, it is very suitable to visualize the progression of events and their outcome.
And in my opinion, they look great!

Therefore, let’s see how to do in Python:
Jupyter Notebook here

## In matplotlib

Personally, in matplotlib they look awful.

The above plot is probably closer to the original concept of a Sankey plot (originally invented in 1898), but it is not something I would use in a publication.

The other solution is to use the library Plotly.

## In Plotly

Continue reading “How to do a Sankey Plot in Python”

## Create a weather forecast model with ML

How to create a simple weather forecast model using ML and how to find public available weather data with ERA5!

As a data scientist at Intellegens, I work on a plethora of different projects for different industries including materials, drug design, and chemicals. For one particular project looking I was in desperate need of weather data. I needed things like, temperature, humidity, rainfall, etc. Given the spacetime coordinates (date, time and GPS location). And this made me fall into a rabbit hole so deep, that I decided to share it with you!

## Weather Data

I thought that finding an API that could give this type of information was going to be easy. I didn’t foresee weather data to be one of the most jealously kept types of data.

If you search for “free weather API”, you will see plenty of similar websites with different services but not actually free and even if there is a free package, it will never have historical weather records.You really need to search hard before finding the Climate Data Store (CDS) web site.

Continue reading “Create a weather forecast model with ML”

## Testing in Python

After having seen how to test in R.

Let’s see how to do the same in Python:

## Writing a tests-oriented program

A good practice demands that we should try to write our test before we code the program we intended to.

Or at least, try to write the code in a way that is easier to test in the future.
Fighting our natural tendency to write the code you desperately want to write and then the tests.

To do that, follow these guidelines:

## Alpha parameter doesn’t work on geom_rect!!! Sort of…

The parameter alpha in the R package ggplot2 is used to express the transparency of the fill colour of the function geom_

However for the function geom_rect it might not work as aspected.

In my latest work, I tried to combine different geom function but I was stuck when all was covered when I used geom_rect.
Let’s see an example:

If we plot the data using geom_jitter and geom_boxplot we obtain the plot:

Continue reading “Alpha parameter doesn’t work on geom_rect!!! Sort of…”

# Application on Biological sequences

As seen thus far, MC and HMM are powerful methods that can be used for a large variety of purposes. However, we use a special case of HMM named Profile HMM for the study of biological sequences. In the following section, my description of this system should explain the reasoning behind the use of Profile HMM.

# Analysis of a MSA

Let us consider a set of functionally related DNA sequences. Our objective is to characterise them as a “family”, and consequently identify other sequences that might belong to the same family [1].

We start by creating a multiple sequence alignment to highlight conserved positions:

It is possible to express this set of sequences as a regular expression. The family pattern for this set of sequences is:

$[AT][CG][AC][ACGT]^{*}A[TG][GC]$ Continue reading “Hidden Markov Model applied to biological sequence. Part 2”

## Introduction on Markov Chains Models

The Markov Chains (MC) [1][2] and the Hidden Markov Model (HMM) [3] are powerful statistical models that can be applied in a variety of different fields, such as: protein homologies detection [4]; speech recognition [5]; language processing [6]; telecommunications [7]; and tracking animal behaviour [8][9].

HMM has been widely used in bioinformatics since its inception. It is most commonly applied to the analysis of sequences, specifically to DNA sequences [10], for their classification [11], or the detection of specific regions of the sequence, most notably the work made on CpG islands [12].

### Overview

The Markov Chain models can be applied to all situations in which the history of a previous event is known, whether directly observable or not (hidden). In this way, the probability of transition from one event to another can be measured, and the probability of future events computed.

The Markov Chain models are discrete dynamical systems of finite states in which transitions from one state to another are based on a probabilistic model, rather than a deterministic one. It follows that the information for a generic state $X$ of a chain at the time $t$ is expressed by the probabilities of transition from the time: $t-1$.

Continue reading “Hidden Markov Model applied to biological sequence. Part 1”

## Bag Of Words

The bag of words (BOW) is a strategy usually adopted when we need to convert a document of strings, e.g., words in a book, into a vector of pure numbers that can be used for any sort of mathematical evaluation.

At first, this method was used in text and image classification, but it has recently been introduced into bioinformatics, and can be successfully applied to DNA/AA sequences’ repertoires.

With the BOW approach, we can redefine a highly complex document, such as a picture or a book, into a smaller set of low-level features, called codewords (also a codebook, dictionary or vocabulary).

The quality and origin of features is arbitrary, i.e. if we want to analyse a book or a series of books, we can choose as features all the words present inside the books, or the letters, or the combination of letters. As for the origin, the features can be all words present in the same books or all the words in the English dictionary, etc. As a result, the length of a codebook is defined by the number of features chosen.

1. Tom likes to go to the cinema on Sunday.
2. Martin likes to go to the cinema on Tuesday.
3. George prefers to visit science museums.
4. Carl prefers to visit art museums.