Exploratory Data Analysis in Python – (for beginners)
Exploratory data analysis is a process of gaining a better understanding of different aspects of data by carrying out some data analysis. Exploratory data analysis (EDA) in python is a very important step in data sceince. In this article, I will explain step by step, on how to do exploratory data analysis in python. I will use python 3 and Jupyter notebook to run and display the codes.
I will use loan data I downloaded from Kaggle. Kaggle is an online community of data scientists and machine learners that provides users with datasets. I will be using the dataset to explain some concepts in exploratory data analysis. I will also use pandas library. Pandas is a python library for data manipulation and analysis.
So let’s begin, download the dataset here and save it in the directory you are using.
Reading the dataset
To read the dataset, let us start by importing pandas.
Pandas library allows you to import (read) your dataset, manipulate and analyze as you want.
We can now read our dataset as a dataframe called loan and check the first 5 rows using .head().
We can also check the last five columns of the dataset by using .tail().
Now that we have successfully read our dataset and checked the first and the last 5 rows of the data, let’s proceed and check the columns separately.
This gives us an array of column names of our dataset.
Next, we will use .info() to check the number of entries in our dataset.
In this dataset, we have 614 entries and 13 columns. The function has also given us the data types of each column. For example, the Loan_ID column is an object type while ApplicantIncome is an integer. It also shows the number of non-missing values in a column. For example, in the Loan_ID column, there are no missing values while in Gender, 601 are non-missing values which implies that there are 13 missing values in that column. This is therefore very important for us to start understanding how our dataset looks like.
Exploratory Data Analysis
Let us explore the data more by checking the basic statistics of the dataframe using .describe().
When called on the entire dataframe, this function gives basic statistics for the numeric columns only. These include count (the number of non-null entries in that column), mean, standard deviation, minimum and maximum values, and the percentiles. This function can also be used with string columns to return value counts, the number of unique entries, most frequently occurring (‘top’) unique entry and the number of times the most frequent entry occurs (‘freq‘).
Let us do this by using the Property_Area column.
We can see that there are 614 entries in that column with 3 unique entries. The most frequent entry is Semiurban and it appears 233 times. You can play around with it in your dataset by calling it on any other column to see the basic statistics.
Note that the .describe() function gives us a summary of the basic statistics for the columns. Assume you want to get the basic statistics one by one. We will use specific functions to get individual statistics. Let’s use the LoanAmount column to explore this.
Let us again assume that we want to know the number of unique entries in a column. For instance, the Education column. Here, we will use .nunique() function as shown below.
From the results above, we have 2 unique entries in the education column from our dataset. To find out the actual names of the 2 unique entries, we can use .unique() function.
It shows that the unique entries in the Education column are Graduate and Not Graduate. You can use this function on any column to get the number of unique entries.
Great! I hope this is useful. In the next article, I will give more advanced exploratory data analysis in python.