Create a weather forecast model with ML

How to create a simple weather forecast model using ML and how to find public available weather data with ERA5!

As a data scientist at Intellegens, I work on a plethora of different projects for different industries including materials, drug design, and chemicals. For one particular project looking I was in desperate need of weather data. I needed things like, temperature, humidity, rainfall, etc. Given the spacetime coordinates (date, time and GPS location). And this made me fall into a rabbit hole so deep, that I decided to share it with you!

Weather Data

I thought that finding an API that could give this type of information was going to be easy. I didn’t foresee weather data to be one of the most jealously kept types of data.

If you search for “free weather API”, you will see plenty of similar websites with different services but not actually free and even if there is a free package, it will never have historical weather records.You really need to search hard before finding the Climate Data Store (CDS) web site.

Continue reading “Create a weather forecast model with ML”

Testing in Python

After having seen how to test in R.

Let’s see how to do the same in Python:

Writing a tests-oriented program

A good practice demand that we should try to write our test before we code the program we intended to.

At least, we can try to write the code in a way that is easier to test in the future. Trying to fight out natural tendency to write the tests after your code.

To do that try to follow these guidelines:

Guidelines

Continue reading “Testing in Python”

Alpha parameter doesn’t work on geom_rect!!! Sort of…

The parameter alpha in the R package ggplot2 is used to express the transparency of the fill colour of the function geom_

However for the function geom_rect it might not work as aspected.

In my latest work, I tried to combine different geom function but I was stuck when all was covered when I used geom_rect.
Let’s see an example:

library("dplyr")
library("ggplot2")
df = 
  data.frame(
    x = c(rep(1, 25), rep(2, 25), rep(3, 25)),
    y = c(sample(1:50, 25), sample(51:100, 25), sample(101:150, 25)),
    classes = c(rep("A", 25), rep("B", 25), rep("C", 25)))

head(df)
#  x  y classes
#1 1 45       A
#2 1  4       A
#3 1 41       A
#4 1 32       A
#5 1  8       A
36 1 14       A

If we plot the data using geom_jitter and geom_boxplot we obtain the plot:

ggplot(data = df,
       aes(x = x, y = y, colour = classes)) + 
  geom_jitter() +
  geom_boxplot(alpha = 0.2) +
  theme_minimal()
Continue reading “Alpha parameter doesn’t work on geom_rect!!! Sort of…”

Hidden Markov Model applied to biological sequence. Part 2

This is part 2, for part 1 follow this link.

Application on Biological sequences

As seen thus far, MC and HMM are powerful methods that can be used for a large variety of purposes. However, we use a special case of HMM named Profile HMM for the study of biological sequences. In the following section, my description of this system should explain the reasoning behind the use of Profile HMM.

Analysis of a MSA

Let us consider a set of functionally related DNA sequences. Our objective is to characterise them as a “family”, and consequently identify other sequences that might belong to the same family [1].

We start by creating a multiple sequence alignment to highlight conserved positions:

ACAATG
TCAACTATC
ACACAGC
AGAATC
ACCGATC

It is possible to express this set of sequences as a regular expression. The family pattern for this set of sequences is:

[AT][CG][AC][ACGT]^{*}A[TG][GC] Continue reading “Hidden Markov Model applied to biological sequence. Part 2”

Hidden Markov Model applied to biological sequence. Part 1

Introduction on Markov Chains Models

The Markov Chains (MC) [1][2] and the Hidden Markov Model (HMM) [3] are powerful statistical models that can be applied in a variety of different fields, such as: protein homologies detection [4]; speech recognition [5]; language processing [6]; telecommunications [7]; and tracking animal behaviour [8][9].

HMM has been widely used in bioinformatics since its inception. It is most commonly applied to the analysis of sequences, specifically to DNA sequences [10], for their classification [11], or the detection of specific regions of the sequence, most notably the work made on CpG islands [12].

Overview

The Markov Chain models can be applied to all situations in which the history of a previous event is known, whether directly observable or not (hidden). In this way, the probability of transition from one event to another can be measured, and the probability of future events computed.

The Markov Chain models are discrete dynamical systems of finite states in which transitions from one state to another are based on a probabilistic model, rather than a deterministic one. It follows that the information for a generic state X of a chain at the time t is expressed by the probabilities of transition from the time: t-1.

Continue reading “Hidden Markov Model applied to biological sequence. Part 1”

Bag-of-words and k-mers

Bag Of Words

The bag of words (BOW) is a strategy usually adopted when we need to convert a document of strings, e.g., words in a book, into a vector of pure numbers that can be used for any sort of mathematical evaluation.

At first, this method was used in text and image classification, but it has recently been introduced into bioinformatics, and can be successfully applied to DNA/AA sequences’ repertoires.

With the BOW approach, we can redefine a highly complex document, such as a picture or a book, into a smaller set of low-level features, called codewords (also a codebook, dictionary or vocabulary).

The quality and origin of features is arbitrary, i.e. if we want to analyse a book or a series of books, we can choose as features all the words present inside the books, or the letters, or the combination of letters. As for the origin, the features can be all words present in the same books or all the words in the English dictionary, etc. As a result, the length of a codebook is defined by the number of features chosen.

  1. Tom likes to go to the cinema on Sunday.
  2. Martin likes to go to the cinema on Tuesday.
  3. George prefers to visit science museums.
  4. Carl prefers to visit art museums.
Continue reading “Bag-of-words and k-mers”

Bayes’ theorem

Overview

Bayes’ theorem is today considered one of the main theorems in statistics, and one of the most applied formulae in science.

its importance grew steadily until the middle of the last century. And it is now considered essential in all statistics courses, and applied in almost every field of research and not least in Bioinformatics, where it has been applied extensively to the biological system analysis.

At first glance, Bayes’ theorem can seem confusing, counterintuitive, and hard to grasp. We know that, for many, statistics is not intuitive, as with other aspects of mathematics.

However, if we analyse the thought processes leading Bayes to his theorem, we see that these are natural and logical ways of thinking.

Continue reading “Bayes’ theorem”

K-Means in R and Python

K-means is one of the most popular unsupervised algorithm for cluster analysis.

It cannot determine the number of clusters (k) within the dataset, therefore this has to be provided prior the initialisation of the algorithm.

The basic idea of the K-mean algorithm is that data points within a same group will gather on the plane near each other. Consequently close points are likely to belong to the same cluster.

The K-means algorithm performs better as compared to a hierarchical algorithm (HC) and the execution takes less time, with a time complexity of O(n^{2}), lower than other HC methods that have a time complexity between O(n^{3}) and O(n^{2}\log n ).

On the other hand, HC provides good quality results in respect to K– means.

In general, a K-means algorithm is good for a large dataset and HC is good for small datasets.

The algorithm

Given a set of points, where each point is an n-dimensional vector, the algorithm is able to separate the n points into k sets (with K \leq n, forming a number of clusters S ={S_{1},S_{2},\ …\ ,S_{k}} with centers \left( \mu_{1},\ …\ , \mu_{k} \right) , the Within-Cluster Sum of Squares formula (WCSS) is defined as:.

\\WCSS = \min \sum_{i=1}^{k}\sum_{x_{j}\in S_{i}} \left| x_{j}-\mu_{i}\right| ^{2}\\

The algorithm is composed of four steps:

Continue reading “K-Means in R and Python”

Introduction to Support Vector Machine (2)

Follow this link for part one of this introduction.

Finding the best hyperplane

In our example, we have used two-dimensional data: length vs. width of flower petals.

Because the hyperspace is two dimensional, the hyperplane must be one dimension smaller, thus, a line.

The formula of a line is: y=ax+b, and this formula is not dissimilar to the general formula of the hyperplane, that is defined as:

w^{t}x=0

The left-hand side of equation can be considered as the inner product of two vectors. Indeed, when we are dealing with points in space, as in this case, it is useful to use the concept of vectors.

The introduction of the concept of vectors should not be a surprise, given the name of the subject. However, explaining the entire vector algebra and the mathematics would be needlessly long, Here, I am giving a minimal amount of information in order to understand the concept:

Definition: A vector is defined as any quantity with a magnitude and a direction.

In other words, a vector exists between the origin O(0,0) and a point in the space.

Continue reading “Introduction to Support Vector Machine (2)”

Introduction to Support Vector Machine (1)

Support Vector Machine (SVM) is a supervised learning model used for data classification and regression analysis.

It is one of the main machine learning methods used in modern-day artificial intelligence, and it has spread widely in all fields of research, not least, in bioinformatics. SVM can be used for regression and classification analysis.

The SVM classification method has, in general, a good classification efficiency, and it is flexible enough to be used with a great range of data. Programming languages, like R or Python, offer several libraries to compute and work with SVMs in a simple and flexible way.

Brief history

The idea behind the Support Vector Machines (SVM) is to be able to find a separating line between two classes of elements as points in space.

Continue reading “Introduction to Support Vector Machine (1)”