General Project Overview
In this Kaggle competition set of 5,863 chest X-ray images (anterior-posterior) were selected from retrospective cohorts of paediatric patients, between the age of one to five years old, in Guangzhou. All chest X-ray imaging was performed as part of the patient’s routine clinical care.
All chest radiographs were screened by two expert physicians for quality control and removing all low quality or unreadable scans.
In the picture below are show the three type of chest X-ray present in the database:
On the left-hand side it is present an image of a healthy individual with clear lungs and no areas of abnormal opacification. In the middle and right-hand side images are present a patient affected by bacterial and viral pneumonia respectively. The latter, presents a more diffuse ‘‘interstitial’’ pattern in both lungs. while the first typically exhibits a focal lobar consolidation, in this case in the right upper lobe (white arrows).
For this particular challenge we are requested to discern only healthy vs pneumonia affected chest X-ray.
Pneumonia is a form of acute respiratory infection that affects the lungs. When an individual has pneumonia, the lungs’ alveoli are filled with pus and fluid, which makes breathing painful and limits oxygen intake.
Pneumonia is the single largest infectious cause of death in children worldwide. Pneumonia killed 808 694 children under the age of 5 in 2017, accounting for 15% of all deaths of children under five years old. [Pneumonia – WHO]
More recently, the novel Covid-19 virus and the consequent pandemic has seen a dramatic increase in pneumonia cases all over the world and it is considered one of the most serious consequences of this new virus. [George et al. (2020)] [Pneumonia – NHS]
This is a manually cureted dataset of 5,863 X-Ray images, in JPEG format.
The images are organized into 3 folders: (train, test, val). In each folder are two subfolders for each image category:
PNEUMONIA, the pneumonia affected individual and
NORMAL healthy patients.
The objective of this kaggle challenge is a binary classification of images. Being a computer vision problem, I have chosen to use Convolutional Neural Network as a Deep Learning model for this challenge.
Exploratory Data Analysis
Before any attempt of creating a ML/DL model is good practice to start with an Exploratory Data Analysis (EDA) of the database. During this stage, we analyse the data set to summarise its main characteristics and gain major insight into its features using numerical analysis and visual methods.
In my experience in the healthcare and in the bioinformatics sectors, databases are usually particularly messy and noisy. Experimental and clinical data might come from many different sources (public or company databases), different laboratories following different protocols. It is imperative to be very careful with the data, check if the data provided are correct in the input phase and have multiple checks along the pipeline.
For this particular challenge, we are close to an idea scenario. We have got a manually curated dataset of high quality black-and-white pictures that do not require data cleaning, analysis of missing value, feature scaling, correlation analysis, etch. But I will proceed to the analysis of the data folder, and the images contained in it.
From the output above, we can see that the folder Train has got almost 90% of all images (89.07%) with the 3/4 of images being Pneumonia. While in the folder Test we have only 10% of all images, with 2/3 being labelled as pneumonia. And in the folder Val, there are present only 16 images in total.
In a normal machine learning analysis, it is good practice not to use the entire database to train our model but to split into subgroups and use only one of them at the time to train and the remaining for testing and validation.
In general, for small databases I prefer to create a k-cross-validation system, in which, iteratively, the whole dataset is used as part of train and test sets. While, for very large datasets, or those requiring great computational time, I prefer to split the database into a training, testing and validation set.
For this exercise I choose to use a train/test/validation system. At this moment, with low computing power to perform a cross validation in a timely manner. Plus, given the nature of composition of the folders I have decided to rearrange the data and divide into a different ratio.
In the top row of the images above, we can see three examples of NORMAL images, those representing healthy individuals with clear lungs and no opacity. While the bottom row shows three PNEUMONIA individuals with infected and more opaque lungs.
Also we can notice that now all images have the same length/width ratio.
First, I create a pandas dataframe that will contain the path to each of the images and the label associated with them.
Continue by dividing the dataset again in a determined ratio.
I will use 60% of the data as a training set, 20% for the validation and test sets.
Correct for data imbalance
In the now rearranged sets we have an equal ratio of Pneumonia/Normal images in each subsets. However, the dataset is indeed still imbalanced with a different number of Pneumonia/Normal images within each of the sets.
This can be a potential problem, indeed with our current ratio, a model could reach 73% accuracy just by classifying everything as Pneumonia. To solve this problem, I compute the class weights to indicate an imbalance in our data to our model using the parameter
The NORMAL images are underrepresented in the test set, therefore, they are weighted more.
Creating the CNN Model
Selecting the metrics
Fitting the model
The numerical values of AUC and binary accuracy and the ROC curve plot shows that the image classification task was performed with high level of success, even with low number of convolutional layers and epochs.
- Deep learning with Tensorflow 2 and Keras. Gulli, Kapoor, Pal. Second Edition
- Keras code examples
- Keras documentation
- Kaggles’ discussion board and notebooks