If you have been more than five seconds on r/dataisbeautiful/, you will have probably encountered a Sankey plot. Everyone uses to track their expenses, job searching and every multi step processes. Indeed, it is very suitable to visualize the progression of events and their outcome.
And in my opinion, they look great!
Therefore, let’s see how to do in Python:
Jupyter Notebook here
Personally, in matplotlib they look awful.
The above plot is probably closer to the original concept of a Sankey plot (originally invented in 1898), but it is not something I would use in a publication.
The other solution is to use the library Plotly.
Therefore, without further ado:
Plotly is an interactive visualization library. Very suitable for Jupyter and allows not only to see the plot but to interact with it and visualize every details.
Plotly is created by a private company however the software is open source and free to be used. The company makes money by hosting your dashboard and offering other services of visualization and machine learning.
A bit oddly, the library requires an internet access and the code for an on-line and off-line is slightly different for what concern saving and visualization.
That can be a bit frustrating if you need to run your code in different settings and environments. It is, therefore, good practice to create plots that do not require internet.
In the following snip of code let’s import the libraries we need for our exercise.
As you can see, we are using the offline version of the library and also importing the mode for the connection to the notebook.
To be fully integrated with the notebook we need a step more:
Now let’s see an example of a multi step process:
Let’s image a 3 steps process with 7 events (A1, A2, B1, B2,B3,C1,C2) and a value of 100 moving through the process:
How many from:
A1 to B1 10
A2 to B2 5
We will refer to the events as nodes of the plot and the movements from node to node as links.
In order to create our plot, first, we need to define the names or labels for our nodes and assign for each node a number (we will see later why):
No that we have the nodes we need the links.
To work plotly needs the name of starting node, the target node and the values that goes from the first to the second.
I choose to hardcode them in this way (I hope you might not need to do it for your real-life project). You can also create a pandas dataframe with the same informations.
With this, it should be clear to you what kind of movement there is from a node to an other.
Unfortunately, Sankey won’t recognise the labels as a string and it requires integers rather than characters.
Using the dictionary we created before, we can convert the labels into the corresponding value.
Now, we can produce our first Sankey plot:
We can also decide to modify our plot, we saw earlier that we can choose to change the labels for the node but we can also change the colours.
I have found seven colours I liked and used them (I used the hex colour code, we’ll see why I prefer these later).
The links between the plots are not terribly nice, and we can modify them.
However, is not as straightforward as we say for the node.
First, we need to decide the colour, I choose to use the same colour of the target node, but mode faded.
Second, we cannot use the hex code as before the it requires the rbg code in a particular way “rgba(x,y,z, num)”.
In order to do this, I assign the link to the right target node, and convert it into rbg.
This is a bit of a work but the result looks great!