Create a weather forecast model with ML

Photo by Pero Kalimero on Unsplash

How to create a simple weather forecast model using ML and how to find public available weather data with ERA5!

As a data scientist at Intellegens, I work on a plethora of different projects for different industries including materials, drug design, and chemicals. For one particular project looking I was in desperate need of weather data. I needed things like, temperature, humidity, rainfall, etc. Given the spacetime coordinates (date, time and GPS location). And this made me fall into a rabbit hole so deep, that I decided to share it with you!

Weather Data

I thought that finding an API that could give this type of information was going to be easy. I didn’t foresee weather data to be one of the most jealously kept types of data.

If you search for “free weather API”, you will see plenty of similar websites with different services but not actually free and even if there is a free package, it will never have historical weather records.You really need to search hard before finding the Climate Data Store (CDS) web site.

What is the CDS?

The CDS is a service provided by the EU that offers all kind of climate information from any part of the work and most relevant for us they have the ERA5.

The ERA5 is a reanalysis dataset model that provides hourly weather data from 1979 to 2019, and it has a free of use, no limitation and the API can be used in python, BINGO!

What you need to do is to sign up to the website, copy the CDS API key and install it.

The service is so well done that you can use this page link to select what kind of data you want, period and format. Press “Show API request” and it will show the exact Python code needed to download the data. Fantastic!!!

This is all sounds, but some work is still to be done:

Example: Temperature in Cambridge (UK)

Let’s see now an example of how to fetch weather records in here, we will see the temperature at 2 meter from surface for Cambridge where Intellegens is located. 

The Latitude/Longitude for the city of Cambridge is 52.205337/0.121817, and let’s start by getting the temperature for the first week of April in 2019 (Monday, 01-07/04/2019)

First we need to import the package for the API, cdsapi and the one to hande the format of file we will download  netCDF4.

import cdsapi
import netCDF4
from netCDF4 import num2date

The numpy package to handle the data and matplotlibe and seaborn to plot them:

import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

Let’s see the function I wrote to download the data “get_weather_data”


def get_weather_data(year, month, day, area, time, 
                     file_location = 'weather_data.nc'
                    ):
    """
    Input:
        year, month, day, area, time, file_location: Strings
    Outputs:
        file_location: A string as in input
    """
    c = cdsapi.Client() 
    c.retrieve(
        'reanalysis-era5-single-levels',{
            'product_type':'reanalysis', # This is the dataset produced by the CDS
            'variable':['2m_temperature'],
            'year': year,
            'month': month,
            'day': day,
            'area': area,
            'time': time,
            'format':'netcdf' # The format we choose to use
        },
        file_location)
    return(file_location)

With the function above we can download the data and save it in the ‘nc’ format, we’ll see what kind of input requires below:

For this example, let’s use the position of Cambridge in England:

latitude  = "52.205337"
longitude = "0.121817"

This API find the weather data for a  rectangular area,with four values, however, we can solve it by using the same lat and long twice as seen in the code below:

Also the weather function requires the date, divided in year, month and day: Each need to be a string without abbreviation and even the single digits for days and months must have 0s in front or that will cause an error.

# Let's proceed with the request of the data:
file_location = get_weather_data(
	year  = "2019",
    
	# NB: Single digits days or months must have 0s in front or that will cause an error
	day   = ['01', '02', '03','04', '05', '06','07'],
	month = "04",
    
	# The ERA5 accept rectangular shape grid as a searching areas
	# but we can use also input a point with this system:ß
	area  = latitude +'/'+ longitude +'/'+ latitude +'/'+ longitude,
    
	# We can request all 24 hours of the day. The only accepted format are listed here:
	time  = ['00:00', '01:00', '02:00', '03:00', '04:00', '05:00',
         	'06:00', '07:00', '08:00', '09:00', '10:00', '11:00',
         	'12:00', '13:00', '14:00', '15:00', '16:00', '17:00',
         	'18:00', '19:00', '20:00', '21:00', '22:00', '23:00'],
	file_location = 'first_week_Apr_2019.nc')

Now we have fetched the data and downloaded the file ‘first_week_Apr_2019.nc.

Wa we need now it’s to load it, with the following:


from datetime import datetime, timedelta
def read_netcdf_file(file_location):
    f = netCDF4.Dataset(file_location) # This the python package I used to open the nc file
    t2m = f.variables['t2m'][:].flatten()
    time = f.variables['time'][:].flatten()
    start_time = datetime.strptime("01/01/1900 00:00", "%d/%m/%Y %H:%M")
    time_points = []
    for x in range(len(time)):
        hours_from_start_time = start_time + timedelta(hours=int(time.data[x]))
        time_points.append(hours_from_start_time)

    values = pd.DataFrame({
        "t2m" : [x-273.15 for x in t2m.data if x!= -32767], # As we said the values are in Kelvin, to convert in Celsius we have to subtract 273.15
        "time" : time_points
        })
    return(values)

The function above opens the file in format “.nc” and returns a pandas dataframe with the value of temperature and datapoint.

In the file temperature is in Kelvin and it turned into Celsius. Modification is also done to the time points. Those are expressed in hours since 1st of January 1900 and here are returned in datatimes.

These are the results:

weather_data = read_netcdf_file('first_week_Apr_2019.nc')
# This is the results:
print(weather_data)
         t2m                time
0    3.098695 2019-04-01 00:00:00
1    2.889338 2019-04-01 01:00:00
2    2.439536 2019-04-01 02:00:00
3    2.258907 2019-04-01 03:00:00
4    1.965138 2019-04-01 04:00:00
..        ...                 ...
163  8.913069 2019-04-07 19:00:00
164  8.548466 2019-04-07 20:00:00
165  7.649648 2019-04-07 21:00:00
166  7.184301 2019-04-07 22:00:00
167  7.305902 2019-04-07 23:00:00

If we want to plot them, this would be the result:

plt.figure(figsize=(15,3))
sns.lineplot(data = weather_data, x= 'time', y='t2m')
The rise and decrease in temperature for the first week of April 2019

We can see, unsurprisingly, that the temperature rises during the day and then declines.

To better appreciate this phenomenon I wanted to add the moment of sunrise, noon, and sunset and I found this interesting API: https://sunrise-sunset.org/api

That only requires the location and dates. Therefore, let’s write the dates in a API-friendly way:

dates = [] 
for date in weather_data.time:
    dates.append("{}-{}-{}".format(date.year, date.month, date.day))
dates = np.unique(dates)
dates
# array(['2019-4-1', '2019-4-2', '2019-4-3', '2019-4-4', '2019-4-5',
#       '2019-4-6', '2019-4-7'], dtype='<U8')

Once the dates are obtained, we create a dataframe to fill with the moment of sunrise, sunset and solar noon (maxim elevation of the sun) for that week:

sun_position = pd.DataFrame(columns=["sunrises", "sunsets", "solar_noons" ])

Now, we can fetch the data:

import urllib.request
import ast

for date in dates:
    api_request="https://api.sunrise-sunset.org/json?"+\
    "lat="  + latitude +\
    "&lng=" + longitude +\
    "&date="+date
    with urllib.request.urlopen(api_request) as response:
        html = response.read()
        res =ast.literal_eval(html.decode('utf-8'))
        sun_position.loc[len(sun_position)] = [
            datetime.strptime(date+" "+res['results']['sunrise'], "%Y-%m-%d %I:%M:%S %p"), # Replace %H (Hour on a 24-hour clock) with %I (Hour on a 12-hour clock)
            datetime.strptime(date+" "+res['results']['sunset'], "%Y-%m-%d %I:%M:%S %p"),
            datetime.strptime(date+" "+res['results']['solar_noon'], "%Y-%m-%d %I:%M:%S %p")]

This is the result:

Sun_position
sunrises	sunsets	solar_noons
0	2019-04-01 05:33:11	2019-04-01 18:33:25	2019-04-01 12:03:18
1	2019-04-02 05:30:52	2019-04-02 18:35:09	2019-04-02 12:03:01
2	2019-04-03 05:28:34	2019-04-03 18:36:52	2019-04-03 12:02:43
3	2019-04-04 05:26:15	2019-04-04 18:38:36	2019-04-04 12:02:26
4	2019-04-05 05:23:58	2019-04-05 18:40:19	2019-04-05 12:02:08
5	2019-04-06 05:21:40	2019-04-06 18:42:03	2019-04-06 12:01:51
6	2019-04-07 05:19:23	2019-04-07 18:43:46	2019-04-07 12:01:34

And now, we can plot it:

plt.figure(figsize=(15,3))
sns.lineplot(data = weather_data, x= 'time', y='t2m',
             color='grey')
for sun in range(sun_position.shape[0]):
    plt.axvline(x = sun_position.iloc[sun][0], color='r', linewidth = 0.5)
    plt.axvline(x = sun_position.iloc[sun][1], color='b', linewidth = 0.5)
    plt.axvline(x = sun_position.iloc[sun][2], color='y', ls="-.")
The rise and decrease in temperature for the first week of April 2019. With sunset, sunrise and solar noon.

We can see that the coldest moment of the day is the moment before sunrise, the red line. The hottest moment of the day is after the solar noon (the highest position of the sun in the sky), represented by a dotted yellow line and it can last for 3-4 hours. The temperature keeps steadily passing the sunset line in blue. 

Temperature for 2019

A week was fun, but can we repeat the experiment for a one year period?

latitude  = "52.205337"
longitude = "0.121817"

file_location = get_weather_data(
    year  = "2019",
    day   = ['01', '02', '03', '04', '05', '06', '07', '08', '09', '10', '11', '12',
             '13', '14', '15', '16', '17', '18', '19', '20', '21', '22', '23', '24',
             '25', '26', '27', '28', '29', '30', '31'],
    month = ['01', '02', '03','04', '05', '06', '07', '08', '09','10', '11', '12'],
    area  = latitude +'/'+ longitude +'/'+ latitude +'/'+ longitude,
    time  = ['00:00', '01:00', '02:00', '03:00', '04:00', '05:00',
             '06:00', '07:00', '08:00', '09:00', '10:00', '11:00',
             '12:00', '13:00', '14:00', '15:00', '16:00', '17:00',
             '18:00', '19:00', '20:00', '21:00', '22:00', '23:00'],
    file_location = 'weather_for_2019.nc'
)

If we print them:

values_for_2019 = read_netcdf_file('weather_for_2019.nc')

plt.figure(figsize=(15,3))
sns.lineplot(data = values_for_2019, x= 'time', y='t2m')
The temperature for Cambridge in 2019

From the plot, we can see that the temperature rises until August and then declines (truly unremarkable, for a city in the northern hemisphere). 

There are three peaks of temperature: In April, the hottest day in August and one last peak just before September.

Temperature since 1979

The temperature behaviour for one year was interesting, but the database goes back to 1979. What would that look like?

For this type of research, it is better to not download every single day, month and year. But if you really are so inclined, I would suggest downloading one year at the time.

latitude  = "52.205337"
longitude = "0.121817"

file_location = get_weather_data(
    year  = ['1979', '1980', '1981', '1982', '1983', '1984', '1985', '1986', '1987', '1988', 
             '1989', '1990', '1991', '1992', '1993', '1994', '1995', '1996', '1997', '1998',
             '1999', '2000', '2001', '2002', '2003', '2004', '2005', '2006', '2007', '2008',
             '2009', '2010', '2011', '2012', '2013', '2014', '2015', '2016', '2017', '2018',
             '2019'],
    day   = ['01', '06', '11', '16', '21', '26'],
    month = ['01', '04', '07', '10'],
    time  = ['03:00', '09:00', '15:00', '21:00'],
    
    area  = latitude +'/'+ longitude +'/'+ latitude +'/'+ longitude,

    file_location = 'weather_from_1979.nc'
)

And the result is:

weather_from_1979 = read_netcdf_file('weather_from_1979.nc')
plt.figure(figsize=(15,3))
sns.lineplot(data = weather_from_1979, x = 'time', y = 't2m')

I think it is nice to see how the temperature oscillates. 

It doesn’t seem to change in the years, but is that true? We can try to make a linear regression and see it for ourselves.

Let’s do some simple prediction

Because datatimes don’t work well with linear-regression, we convert all the dates back into days and hours.

points = []
for x in range(len(weather_from_1979.time)):
    b = datetime.strptime(str(weather_from_1979.time[x]), "%Y-%m-%d %H:%M:%S")
    a = datetime.strptime(str(weather_from_1979.time[0]), "%Y-%m-%d %H:%M:%S")
    c = b-a
    points.append(c.days +c.seconds/3600/24)

We then apply the regression.

from sklearn.linear_model import LinearRegression
X = np.array(points).reshape(-1, 1)
Y = np.array(weather_from_1979.t2m).reshape(-1, 1)
linear_regressor = LinearRegression()
linear_regressor.fit(X, Y)  
Y_pred = linear_regressor.predict(X) 

And plot it.

plt.figure(figsize=(17,5))
plt.plot(X, Y)
plt.plot(X, Y_pred, color='red')
plt.show()
All temperature in Cambridge since 1979.

It is hard to see, but the red line is going a bit upward.

If we look at the predictions, the temperature for 2019 is 2 degrees higher than it was in 1979!

print(Y_pred)

array([[ 9.29849547],
       [ 9.29853431],
       [ 9.29857315],
       ...,
       [11.61472465],
       [11.61476349],
       [11.61480233]])

Indeed, the average temperature for 1979 and 2019 confirms

print(weather_from_1979["t2m"][weather_from_1979.time<="1980-01-01"].mean())
print(weather_from_1979["t2m"][weather_from_1979.time<="2018-12-31"].mean())
# 8.441398057975517
# 10.423848775338085

that the temperature is rising …

Predict future temperature with Random Forest Regression

Let’s step up the game and create a more complex machine learning model to predict future temperature.

The tool I choose to use is Random Forest Regressor from the popular package Sklearn.

Let’s import everything we need:

from sklearn.ensemble import RandomForestRegressor # This is the main tool we are going to use
from sklearn.model_selection import train_test_split # With this we will split the dataset in a training and testing set.
from sklearn.metrics import r2_score # With this metric we will measure how good our model is. 

X_train, X_val, y_train, y_val = train_test_split(
    X , Y
    , test_size = 0.2
    , random_state = 42
)

forecast_model  = RandomForestRegressor(n_estimators=500)

forecast_model.fit(X_train, y_train)
val_predictions = forecast_model.predict(X_val)
print("R2 : {}".format(r2_score(np.array(y_val), val_predictions )))

# R2 : 0.7351425319335796

The model as a R-square of 0.75, not terrible but good enough for a quick model!!

If we now use  the entire database as training set:

forecast_model = RandomForestRegressor(n_estimators=500)
forecast_model.fit(X, Y)

RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
                      max_features='log2', max_leaf_nodes=None,
                      min_impurity_decrease=0.0, min_impurity_split=None,
                      min_samples_leaf=1, min_samples_split=4,
                      min_weight_fraction_leaf=0.0, n_estimators=500,
                      n_jobs=None, oob_score=False, random_state=None,
                      verbose=0, warm_start=False)

And let’s see how the model predicts the temperature right at this moment.But before we have to convert the time and repeat what seen above:

now = datetime.now()
from_1979 = datetime.strptime(str(weather_from_1979.time[x]), "%Y-%m-%d %H:%M:%S")

difference = now-from_1979
difference = np.array(difference.days +difference.seconds/3600/24)
​
model_dang.predict(difference.reshape(-1, 1))

# array([6.23150068])

The model has predicted a temperature of 6 degrees and there are actually 7 degrees, not bad!

Conclusions:

In this article, we have seen that we can download all historical weather data and with simple machine learning  tool we can create our own forecast system that is not terribly bad!

Leave a Reply

Your email address will not be published. Required fields are marked *