Exploratory data analysis (EDA) is the process of analyzing datasets using different visualizations and basic summary statics to understand the various relationships, distributions, etc of data variables. It is generally the first step performed with new datasets to get insights about data. Doing EDA manually, where we create various visualizations and statistics by our selves can sometimes result in mistakes. It can also take a lot of time. The time which can be otherwise utilized in performing other more important tasks.
Sweetviz is a wonderful and very useful Python library that provides us with the EDA of a given dataset with just 2 lines of code. It generates an independent HTML page report with interactive visualizations of a dataset. It can save a lot of our time which would have otherwise been spent doing EDA manually. It also saves us from mistakes which we could introduce when doing things by ourselves.
Sweetviz let us perform a list of different analyses as mentioned below.
We'll now start explaining how to use sweetviz with examples.
We have imported the necessary libraries for our purpose. We'll be using various datasets available from scikit-learn for explanation purposes.
import pandas as pd
import sweetviz
print("SweetViz Version : {}".format(sweetviz.__version__))
Below we have loaded 3 datasets available from scikit-learn which we'll be using in our examples. We have loaded each dataset as a pandas dataframe and displayed the first few lines for each to give an idea about the contents of the datasets.
from sklearn import datasets
from sklearn.model_selection import train_test_split
wine = datasets.load_wine()
wine_df = pd.DataFrame(data=wine.data, columns=wine.feature_names)
#wine_df["WineType"] = [wine.target_names[typ] for typ in wine.target]
wine_df["WineType"] = wine.target
wine_df.head()
diabetes = datasets.load_diabetes()
diabetes_df = pd.DataFrame(data=diabetes.data, columns=diabetes.feature_names)
diabetes_df["Progression"] = diabetes.target
diabetes_df.head()
boston = datasets.load_boston()
boston_df = pd.DataFrame(data=boston.data, columns=boston.feature_names)
boston_df["Price"] = boston.target
boston_df.head()
As a part of this section, we'll explain how to perform EDA using the datasets loaded above. Sweetviz provides 3 different methods primarily for performing exploratory data analysis. We have given definitions of each so that it becomes easy to use them.
Please make a NOTE that each of the above-mentioned methods returns an instance of DataframeReport. This instance has two important methods which let us show interactive EDA reports either as an independent HTML application or inside of a jupyter notebook.
Each of the above-mentioned methods will show a progress bar when it's generating EDA.
Below we have generated an EDA report for the wine dataset using analyze() method. It returns an instance of type DataframeReport. We can then use it to display reports.
report = sweetviz.analyze(wine_df)
report
Below we have called report_html() method on DataframeReport object. It'll open an HTML report in a browser.
report.show_html()
We'll now explain individual parts of the report. The dashboard generated by all three methods (analyze(), compare(), and compare_intra()) will be the same with few more details present based on the method called.
The summary section gives summary stats about the dataset like the number of samples, a number of features, duplicates, RAM usage, categorical features count, numerical features count, and text feature count. It'll show count for two datasets if we have called compare() or compare_intra() methods. In this case, it'll show summary stats about our whole wine dataset. We are also provided with a button named Associations in this section. If we click on that button, it'll generate a correlation heatmap showing the correlation between all features of the dataset (Only numeric features).
When we click on Associations button from the summary section, the correlation heatmap will appear on the right-hand side of the screen. The heatmap has either squares or circles present in each tile. The circles represent Pearson correlation in the range [-1, 1]. The squares represent categorical associations. The categorical associations go row-wise and show how much association a feature represented by row name on left has with all other features of data. The categorical associations range from [0,1]. The heatmap will have a circle whenever showing the relation between numerical features and squares when showing the relation between categorical features or numerical and categorical features. The diagonal of the chart is left blank as each feature has a total relationship with itself. In our example, the WineType feature is categorical hence row and column representing WineType has squares whereas all other cells have circles because all other features are numerical.
Below the summary section, there is a tab for each feature of our dataset. It has also a tab for the target variable if we have provided a column name to be treated as the target variable. The tab has basic stats about the feature like total values, missing count, min, max, median, average, quantiles, range, standard deviation, etc. It also has a histogram showing the distribution of feature data. We can click on the tab and it'll open one more tab on the right-hand side showing more details about the feature. If we have provided a target variable name then the tab for it'll be present first and it'll be colored black to differentiate it from other columns.
The tab which gets displayed when we click on the feature tab below the summary section has information like actual values of numerical and categorical associations of feature with all other features, few frequent values, few largest values, and few smallest values. It also shows the histogram of feature data distribution again.
The sweetviz also let us show reports inside of jupyter notebook using show_notebook() method which we had explained earlier. Below we have displayed the report inside of the jupyter notebook. We have provided a height parameter with the value of 1500 pixels to increase the height of the report displayed.
report.show_notebook(h=1500)
As a part of this section, we'll explain how we can use sweetviz to perform target variable analysis which can be useful to see the relationship between the target variable and all features of the dataset. We can do so by just providing a column name from the dataframe that we want to use as the target variable in analyze() method.
Below we have generated a report from our diabetes dataframe using analyze() method. We have instructed the method to use Progression column as the target variable.
report = sweetviz.analyze([diabetes_df, "Diabetes Dataset"], target_feat="Progression")
After generating report, we have called show_html() method on DataframeReport object to open it in a browser.
report.show_html()
We can notice from the output that how the target variable tab is highlighted with black color to differentiate it from other columns.
Apart from this, the target variable values are also plotted as a line inside of histogram of the feature. This can be helpful to understand the relationship between the target variable and feature based on feature values. The value of the target variable is represented by Y-axis drawn on the right. When we click on the tab of any feature, we also see that association of that feature with the target variable is highlighted with black color.
Below we have generated another example of target variable analysis but this time we have used the wine dataset. We have skipped columns proline and magnesium from original dataset and instructed to use WineType column as numerical using FeatureConfig constructor. We can provide configuration for features of the dataset using this constructor. We can explicitly inform the features that we want to exclude from the report, we want to be considered categorical, numerical, or text. The skip parameter accepts a list of column names to skip from the report. The force_cat, force_text, and force_num accept a list of column names that we want to be considered as categorical, text, and numerical respectively.
config = sweetviz.FeatureConfig(skip=["proline", "magnesium"], force_num=['WineType'])
report = sweetviz.analyze(source=wine_df, feat_cfg=config, target_feat="WineType")
Below we have displayed the report generated for the wine dataset by providing WineType as the target variable.
report.show_html()
Below we have created another example showing usage of analyze() method. We have generated a report for the wine dataset. But this time, we have informed the method not to include pairwise relationships between features. This will not include associations details which we used to include in all reports till now.
report = sweetviz.analyze(source=wine_df, pairwise_analysis="off")
report.show_html()
As a part of this section, we'll explain how we can compare two datasets and generate EDA for both. This will help us better understand the distribution of data between two datasets. We can compare train/test, train/validation, test/test and validation/test datasets.
Sweetviz let us generate EDA for two datasets using compare()* method. It'll show EDA for datasets next to each other.
We have first divided our diabetes dataset into train (80%) and test (20%) sets using scikit-learn's train_test_split() method. We'll be comparing these two datasets.
train_df, test_df = train_test_split(diabetes_df, train_size=0.8)
train_df.shape, test_df.shape
Below we have generated an EDA report comparing train and test datasets generated from the diabetes dataset. We have informed the method to use Progression column as the target variable. We have then called show_html() on the report to open it in a new window of the browser.
report = sweetviz.compare(source=train_df, compare=test_df, target_feat="Progression")
report.show_html()
We can notice above from the full EDA report image that it shows details for both datasets.
Below we have included images for individual sections as well to give an idea about report sections.
We can notice that the summary section now has summary details for both datasets. There are two Associations button which shows associations heatmap for both datasets when clicked.
Individual column stats have now statistics for both datasets given as input. They are highlighted using different colors. The histogram of distribution is also generated for both datasets in a single chart with different colors. There are two lines in the histogram based on target variable values in each dataset.
Below we have again generated a report using both datasets but this time we have given names for both datasets when generating a report using compare() method. Please check the report screenshot below to check the names appearing in the summary section.
report = sweetviz.compare(source=[train_df,"Train Set"], compare=[test_df, "Vaidation Set"],
target_feat="Progression")
report.show_html()
There are situations when we need to understand data distribution based on some boolean column of dataset like we want to see EDA for all rows with gender male v/s all rows with gender female. We can do this kind of comparison EDA using compare_intra() method. It'll generate results that are almost identical to that generated by compare() method.
We'll be using our Boston housing dataset to generate the report using compare_intra() method. We have provided values of column CHAS as boolean values to condition_series parameter of the method to inform it to divide dataset based on these boolean values and then generate EDA report. The CHAS variable inside of the Boston hosing dataset has boolean information about whether houses are on the bounds of a river or not. The compare_intra() method will divide the Boston dataset into two datasets based on boolean values of column CHAS and generate EDA comparing those two datasets.
We have also included a screenshot of the report generated below by calling show_html() method on the report.
report = sweetviz.compare_intra(source_df=boston_df,
condition_series=boston_df["CHAS"].astype(bool),
names=["Bounds River","Doesn't Bounds River"])
report.show_html()
All the three methods that we explained earlier generates a report and return an instance of type DataframeReport which we can display by calling report_html() method on it. We can also directly create an instance of DataframeReport with datasets and it'll just work fine.
Below we have created a report of the wine dataset by creating an instance of DataframeReport using constructor. We can then call show_html() method on it and it'll open the report in a new browser tab.
report = sweetviz.DataframeReport(source=wine_df)
Below we have explained another example where we provide train and test sets generated earlier from the diabetes dataset.
report = sweetviz.DataframeReport(source=train_df, compare=test_df, target_feature_name="Progression")
This ends our small tutorial explaining how to use sweetviz library. Please feel free to let us know your views in the comments section.
If you are more comfortable learning through video tutorials then we would recommend that you subscribe to our YouTube channel.
When going through coding examples, it's quite common to have doubts and errors.
If you have doubts about some code examples or are stuck somewhere when trying our code, send us an email at coderzcolumn07@gmail.com. We'll help you or point you in the direction where you can find a solution to your problem.
You can even send us a mail if you are trying something new and need guidance regarding coding. We'll try to respond as soon as possible.
If you want to