Wikipedia defines Machine learning as the scientific study of algorithms and statistical models that computer systems use to perform a specific task without using explicit instructions, relying on patterns and inference instead. Machine learning focuses on the development of computer systems that can access data and use it to learn for themselves. Using algorithms that learn from data, machine learning allows computers to find hidden insights without being explicitly programmed where to look. This article focuses on diabetes prediction using machine learning.

There are 3 main types of machine learning i.e. Supervised Learning, Unsupervised Learning and Reinforcement Learning. As a subset of Artificial Intelligence (AI), machine learning can be used to solve a myriad of problems such as fraud detection, web search results, credit scoring, customer segmentation, email spam filtering, etc. Currently, there is a rise in the use of AI and this is not going to stop anytime soon.

Key Motivation for the Project

I decided to grow my machine learning skills by engaging in diabetes prediction. I did this not only for fun and to learn but also to appreciate the essence of machine learning in solving some of the problems that plague humanity. This is, therefore, an interesting project. So, let us do this!

About the Dataset

This dataset is originally from the National Institute of Diabetes and Digestive and Kidney Diseases. It is provided courtesy of the Pima Indians Diabetes Database and is available on Kaggle. Here is the link to the dataset. It consists of several medical predictor variables and one target variable, Outcome. Predictor variables include the number of pregnancies the patient has had, their BMI, insulin level, age, and so on. The dataset has 9 columns as shown below;

  • Pregnancies               – Number of times pregnant
  • Glucose                     – Plasma glucose concentration a 2 hours in an oral glucose tolerance test
  • BloodPressure            – Diastolic blood pressure (mm Hg)
  • SkinThickness           – Triceps skinfold thickness (mm)
  • Insulin                        – 2-Hour serum insulin (mu U/ml)
  • BMI                            – Body mass index (weight in kg/(height in m)^2)
  • DiabetesPedigreeFunction – Diabetes pedigree function
  • Age                            – Age (years)
  • Outcome                    – Class variable (0 or 1) 268 of 768 are 1, the others are 0

Problem Statement

This is a classification problem of supervised machine learning. The objective is to predict whether or not a patient has diabetes, based on certain diagnostic measurements included in the dataset.

0 – Absence of Diabetes

1 – Presence of Diabetes

Dive In…

In this project, we will use python3 and Jupyter notebook. Feel free to use your preferred IDE. We will go through the project by importing the dataset, conducting exploratory data analysis to get insights and understanding on how the dataset looks like and then build the model. We will further use Decision Trees, Random Forests, Support Vector Machines and XGBoost.

Yufeng Guo, in his article The 7 Steps of Machine Learning, provides the following general framework of steps in supervised machine learning;

  • Data Collection
  • Data Preparation
  • Choosing a model
  • Training the model
  • Evaluating the model
  • Parameter tuning
  • Making prediction

So let us begin by importing the required libraries. We will import data analysis libraries (pandas, numpy) and visualization libraries (matplotlib, seaborn). In addition to that, we will also import warnings so that warnings are hidden from the notebook in case there is. (This is optional).

If using jupyter notebook, it is important to set %matplotlib inline to show the visualizations on the notebook.

Import the dataset and do a few checks as follows;

As you can note, the dataset has 9 columns.

Let us now check the columns and their data types.

From the dataset, 7 columns have int64 data type and 2 columns have float64 data type.

Data Exploration and Visualization

In this section, we will create graphs to displays different distributions of the data and available relationships to allow us to understand it much better. This is a very critical section since it determines how the model will be built.

We will begin but creating distribution plots. To understand more about distribution plots, read my article on Distribution Plots in Python using Seaborn.

i. Checking the distribution of the target variable (Outcome).

From the plot above, the data contains more cases without diabetes (0) than those with diabetes (1).

ii. Checking the distribution of the predictor variables.

Here, we will use both distplot and boxplot as shown below. Let us plot each variable to show its distribution in the dataset.

Distribution of Glucose
Distribution of Pregnancies
Distribution of BMI
Distribution of BloodPressure
Distribution of SkinThickness
Distribution of Age
Distribution of Insulin
Distribution of DiabetesPedigreeFunction

iii. Checking for any missing values in the dataset.

There are no missing values in the dataset. The dataset had already been cleaned. Sometimes while working on a project, you may find a dataset with missing values. It is important to know how to handle the missing data.

iv. Plotting relationships in the dataset.

There are different ways to display relationships using a dataset. You can use pair plots, joint plots, correlations, etc. we will the use pairplot to find out relationships in the dataset.

Next, we will proceed in checking the relationships by visualizing correlations as shown in the table below.

The table displays specific correlations for different variables in the dataset in probability form.

We can plot the correlations using a heatmap as shown below.

Training the Data

We will now split our dataset before we train it. X will contain all the Independent variables while y will have the Dependent variable (Outcome).

After successfully splitting the dataset, let us train it using train_test_split.

Before we build the model, let us impute the zero values in our dataset. If you check the head of the dataset, you will notice that there are some independent variables with zero values. This can make our model not efficient. We therefore, need to impute the zero values by using the mean of the other values in the same column. The code below shows how we can check the zero values in the dataset by printing for each variable.

You can notice that there are a total of 768 zero values in the X dataset (Independent variables). We have also been able to print the number of zero values for each column.

We will now use mean to impute the zero values as shown below. This method computes the mean of the column and imputes the values that have zero with the mean. This makes the dataset more meaningful for machine learning.

Building the Model

As I stated earlier, we will use four models i.e. Random Forests, Decision Trees, XGBoost and Support Vector Machine to get the best accuracy score. ‘Accuracy’ metric is used to evaluate models. It is the ratio of the number of correctly predicted instances in a dataset divided by the total number of instances in the dataset. We will proceed further to explore more metrics to determine the best model.

a. Random Forests

Here is the accuracy score;

Random Forest gives an accuracy_score of 0.7598

b. Decision Trees

As shown above, Decision Tree gives an accuracy_score of 0.7283.

c. XGBoost

XGBoost seems to be doing well with an accuracy score of 0.7795.

d. Support Vector Machine (SVM)

Support Vector Machine (SVM) gives an accuracy score of 0.6378.

Model Selection

Basing our selection criteria on the accuracy score, the best model for this project is XGBoost which gives an Accuracy Score of 0.779 (78%). There are different ways of determining the best model that you can explore and use for your models. For this basic introduction to machine learning, I decided to use the accuracy score as the main metric in choosing the best model.

Feature Importances

As data scientists, we often focus on optimizing model performance for our projects. However, it is important to understand how the features(variables) in our model contribute to prediction. We will then look at the features that are most important when we use the XGBoost model for diabetes prediction.

The graph shows that the most important feature in this diabetes prediction is Glucose followed by BMI.


Finally, let us use XGBoost Model to predict the possibility of a patient having Diabetes or not (1 or 0). The following are the prediction probabilities of absence or presence of Diabetes respectively.

We can see that the patient at index 0 has a 98.1% chance of absence of diabetes, while the patient at index 1 has a 91.8% predicted chance of having diabetes.

Wrapping up

This was a very interesting project. I hope this was helpful in understanding the basic concepts of machine learning in diabetes prediction. Feel free to delve deeper and get more accurate accuracy levels by optimizing your model performance. Let me hear how this works for you.

Here is the full code on my GitHub.