Create a weather forecast model with ML

How to create a simple weather forecast model using ML and how to find public available weather data with ERA5!

As a data scientist at Intellegens, I work on a plethora of different projects for different industries including materials, drug design, and chemicals. For one particular project looking I was in desperate need of weather data. I needed things like, temperature, humidity, rainfall, etc. Given the spacetime coordinates (date, time and GPS location). And this made me fall into a rabbit hole so deep, that I decided to share it with you!

Weather Data

I thought that finding an API that could give this type of information was going to be easy. I didn’t foresee weather data to be one of the most jealously kept types of data.

If you search for “free weather API”, you will see plenty of similar websites with different services but not actually free and even if there is a free package, it will never have historical weather records.You really need to search hard before finding the Climate Data Store (CDS) web site.

Continue reading “Create a weather forecast model with ML”

Testing in Python

After having seen how to test in R.

Let’s see how to do the same in Python:

Writing a tests-oriented program

A good practice demand that we should try to write our test before we code the program we intended to.

At least, we can try to write the code in a way that is easier to test in the future. Trying to fight out natural tendency to write the tests after your code.

To do that try to follow these guidelines:


Continue reading “Testing in Python”

K-Means in R and Python

K-means is one of the most popular unsupervised algorithm for cluster analysis.

It cannot determine the number of clusters (k) within the dataset, therefore this has to be provided prior the initialisation of the algorithm.

The basic idea of the K-mean algorithm is that data points within a same group will gather on the plane near each other. Consequently close points are likely to belong to the same cluster.

The K-means algorithm performs better as compared to a hierarchical algorithm (HC) and the execution takes less time, with a time complexity of O(n^{2}), lower than other HC methods that have a time complexity between O(n^{3}) and O(n^{2}\log n ).

On the other hand, HC provides good quality results in respect to K– means.

In general, a K-means algorithm is good for a large dataset and HC is good for small datasets.

The algorithm

Given a set of points, where each point is an n-dimensional vector, the algorithm is able to separate the n points into k sets (with K \leq n, forming a number of clusters S ={S_{1},S_{2},\ …\ ,S_{k}} with centers \left( \mu_{1},\ …\ , \mu_{k} \right) , the Within-Cluster Sum of Squares formula (WCSS) is defined as:.

\\WCSS = \min \sum_{i=1}^{k}\sum_{x_{j}\in S_{i}} \left| x_{j}-\mu_{i}\right| ^{2}\\

The algorithm is composed of four steps:

Continue reading “K-Means in R and Python”

Ho to do a simple SVM classification in R and python

Support Vector Machine (SVM) is a supervised learning model used for data classification and regression analysis.

It is one of the main machine learning methods used in modern-day artificial intelligence, and it has spread widely in all fields of research, not least, in bioinformatics.

The SVM classification method has, in general, a good classification efficiency, and it is flexible enough to be used with a great range of data.

Languages, like R or Python, offer several libraries to compute and work with SVMs in a simple and flexible way.

Let’s see how to create a classification of the database in R and Python using some basic code.

For this example we will use the Iris dataset.

Continue reading “Ho to do a simple SVM classification in R and python”

Simple linear regression in Python

Let’s see a simple way to produce compute a linear regression using Python.

import matplotlib.pyplot as plt # To plot the graph

# Import a database to use in this case I choose the famous Iris database
import matplotlib.pyplot as plt
import pandas as pd

iris = pd.DataFrame(db.load_iris()['data'], 

Let’s take two columns from the database and plot it:

length=iris['petal length (cm)']
width=iris['petal width (cm)']

plt.scatter(length, width, c=list(iris.index))
Iris database, petal length vs. petal width

Now, to compute the linear regression we need scipy library:

from scipy import stats

# Here we compute the linear regression
slope, intercept, r_value, p_value, std_err = stats.linregress(length, width)

Not surprisingly, our R-squared value shows a really good fit:

r_value ** 2

# 0.9271098389904932

Let’s use the slope and intercept we got from the regression to plot predicted values vs. observed:

def predict(x):
    return slope * x + intercept

fitLine = predict(length)

plt.scatter(length, width)
plt.plot(length, fitLine, c='red')

Tutorial on Luigi, part 3 pipeline: input() and output()

In the last article we saw some small example of a Luigi pipeline, in this article I want to explore how make the different Tasks to comunicate and pass information thus LocalTarget between them.

We already saw that we can use parameters to pass info from a Task to the next, and other nice way is to use the methods: input() and output().

The use of self.input()

Let’s see an example:

class PassPlotNameTask(luigi.Task):
    name      = luigi.Parameter(default= "simple_plot.png")
    directory = luigi.Parameter(default="{}/{}".format(os.getcwd(), 'folder'))

    def requires(self):
        return CreatePlotTask(,

    def output(self):
        return luigi.LocalTarget(

class CreatePlotTask(luigi.Task):
    name      = luigi.Parameter()
    directory = luigi.Parameter()

    def run(self):
        x = range(1, 10, 1)
        y = [i ** 2 for i in x]

        fig = plt.figure()
        ax = plt.subplot(111)

        ax.plot(x, y)
        # Here we replace os.getcwd() with self.input().path
        return fig.savefig("{}/{}".format(self.input().path,

    def output(self):
        return luigi.LocalTarget(

    def requires(self):
        return MakeDirectory(

class MakeDirectory(luigi.Task):
    directory = luigi.Parameter()
    def output(self):
        return luigi.LocalTarget(
    def run(self):

The value for self.input() comes from the result of the method output() inside the Task called by requires() in this case it would be the method MakeDirectory().

Continue reading “Tutorial on Luigi, part 3 pipeline: input() and output()”

Tutorial on Luigi pipeline, part 2: Examples

After the introduction of the previous post, let’s now see an example that I code to better teach myself the use of Luigi’s pipeline.

A Task in Luigi

Here follows a simple Luigi Task:

# Let's import what we need:
import os
import luigi
import matplotlib.pyplot as plt

# The Task:
class CreatePlotTask(luigi.Task):
    # A parameters is equivalent to create a constructor for each Task.
    # We can intend it as declaring a 'variable' for our script.
    # I believe to be good practice to list the parameters before their use.
    # However, in this case it is not necessary.
    name = luigi.Parameter(default= "simple_plot.png") 

    def run(self):
        x = range(1, 10, 1)
        y = [i ** 2 for i in x]

        fig = plt.figure()
        ax = plt.subplot(111)
        ax.plot(x, y)

        return fig.savefig("{}/{}".format(os.getcwd(),
    def output(self):
        return luigi.LocalTarget( 
Continue reading “Tutorial on Luigi pipeline, part 2: Examples”

Tutorial on Luigi pipeline, part 1: Introduction

From the documentation page of Luigi ( I can summarise:

Luigi is a pipeline library designed completely in Python by Spotify to solve all pipeline problem associate with long-running batch process.


The structure of a pipeline in Luigi resamble that of graph, with nodes and edges connecting the nodes.

The “nodes” are called Task and the metodo def requires() provide the connection among the nodes.

If in a pipeline, I would consider to execute the tasks one-after-the-other untill the end, e.g.:

Start -> Task A -> Task B -> Task C -> End.

Continue reading “Tutorial on Luigi pipeline, part 1: Introduction”

How to use yield in Python

Notes on the yield statement in Python

From the Python documentation we can read that:

  1. What it is: The yield statement is used when defining a generator within the body of a generator function. Thus, if you use a yield statement in a function, this creates a generator function instead of a normal function.
  2.  What it does: When a yield statement is executed, the state of the generator is frozen and the value of expression_list is returned to next()’s caller.
  3. How to use it: When a generator function is called, it returns an iterator known as a generator iterator, or simply, a generator. The body of the generator function is executed by calling the generator’s next() method repeatedly until it raises an exception.
Continue reading “How to use yield in Python”

Labelled plot in ggplot2

ggplot2 is an amazing tool for plotting in R. One of the latest feature I found out it is the possibility to label the plot automatically. Let’s check it out:

data <- data.frame(class=c('A','B','C'), 
                   value=c(50, 30, 20))

g <- ggplot2::ggplot(data = data,
                ggplot2::aes(x = class,
                             y = value)
                ) +
  ggplot2::geom_bar(stat = "identity", 
                    ) +
                      size=5) +

ggplot2::ggsave('labelled_plot.pdf', plot=g, device = 'pdf')

This would be the result:

Very easy and fast!