Visualizing Distribution plots in Python Using Seaborn
Data visualization is the graphical or pictorial presentation of data for a better understanding of its correlations and patterns. Data visualization is a very important step in data science. This, therefore, requires every data scientist or analyst to be able to master the intrigues behind the data through visualization. For every machine learning model development, a greater understanding of data through EDA (Exploratory Data Analysis) makes work a lot easier.
Python offers different graphing libraries with lots of features. In this article, we will learn data visualization techniques in python using Seaborn.
According to the seaborn official page,
Seaborn is a Python data visualization library based on matplotlib. It provides a high-level interface for drawing attractive and informative statistical graphics.
We will use data from seaborn inbuilt datasets. This article will focus on the syntax and not on interpreting the graphs. We will learn how to create the following distribution plots;
These plots show the distribution of the dataset. This is the first part of my series on data visualization in Python using seaborn. In the next articles, we will delve into more complex visualizations using seaborn.
Now, let us start by importing seaborn and the dataset. We will use planets dataset in seaborn.
From there, let us proceed and do a few more checks on the dataset before we start visualization. This process is important since it helps in getting a glimpse into the dataset before actual visualization. To get a lot more explanation on this, read my previous article on Exploratory Data Analysis in Python for Beginners. We will use .shape, .columns and .info( ) functions to get the following results.
The dataset has 6 columns and 1,035 rows. It also contains both categorical and numerical columns. We will also check whether there are any missing values in the dataset.
It is evident that there are missing values in orbital_period, mass and distance columns.
Next, let us remove the columns with missing values. Kindly note that when dealing with real-world data, it is important to determine how you will handle the missing values. Deleting them sometimes can be very costly in terms of its impact on the model you plan to create. If you want to understand more about handle missing values, please read my previous article on 4 Techniques for Treating Missing Values in your Data.
The following syntax will help us remove the missing values;
Our dataset now has 498 rows only. Let us use this to create the plots.
1. Dist plot
First and foremost, we will create dist plots. Dist plots show the distribution of a univariate set of observations. Let us plot the distribution of mass column using distplot. The syntax here is quite simple. All we need to do is to use sns.distplot( ) and specify the column we want to plot as follows;
We can remove the kde layer (the line on the plot) and have the plot with histogram only as follows;
2. Joint plot
After that, we will create joint plot. Joint plot is used to plot bivariate data by specifying the kind of parameter we need. For example, we can use ‘scatter’, ‘hex’, ‘kde’, ‘reg’, etc. The general syntax for joint plot requires us to specify the x and y labels, the data we want to use and the kind of plot we need. Let us plot the year column against the distance column using kind=’scatter‘.
Let us repeat the joint plot using kind=’hex’.
We can note the difference between the two plots. Let us do the same using ‘reg‘ and ‘kde‘ as follows.
The third distribution plot is pair plot. Pair plot plots pairwise relationships across an entire dataframe (for the numerical columns) and supports a color hue argument (for categorical columns). Let us plot the whole dataframe planets. The general syntax behind pair plots requires specifying the data to be used only. However, you can specify the hue and palette as you desire especially when dealing with categorical data.
When dealing with categorical data, we can specify hue for categorical data and the palette (color scheme) as follows;
The fourth one is rug plot. A rug plot a plot of data for a single quantitative variable, displayed as marks along an axis.
Last but not least, we will create kde plot. Kde plots are Kernel Density Estimation plots. These KDE plots replace every single observation with a Gaussian (Normal) distribution centered around that value.