Machine Learning using Pandas Profiling and Scikit-learn Pipeline
February 28, 2022
- Machine Learning
Pandas profiling is a Python library that performs an automated Exploratory Data Analysis. It automatically generates a dataset profile report that gives valuable insights. For example, we can know which variables to use and which ones we can drop using the profile report.
A machine learning pipeline is used to automate the machine learning development stages. These stages are dataset ingestion, dataset preprocessing, feature engineering, model training, model evaluation, making predictions, and model deployment.
A machine learning pipeline is made of multiple initialized steps. It uses the steps to automate the machine learning development stages. The steps are initialized in sequential order so that one’s output is used as an input for the next. Therefore, the pipeline steps need to be well-organized for faster model implementation.
Many libraries support the implementation of a machine learning pipeline. We will focus on the Scikit-Learn library. The library provides a Pipeline class that automates machine learning. We will build a customer churn model using Pandas Profiling and Scikit-learn Pipeline.
Table of content
- How the Scikit-learn Pipeline works
- Benefits of using Scikit-learn Pipeline
- Dataset used
- Automated Exploratory Data Analysis with Pandas Profiling
- Missing values
- Dataset splitting
- Importing transformer methods and classes
- Drop columns transformer
- Numeric Transformers
- Categorical transformer
- Combining the initialized transformers
- Applying the transformers
- Adding the final estimator
- Fitting the pipeline
- Getting accuracy score on the training set
- Getting accuracy score on the testing set
To follow along with this article, a reader should:
- Know how to build a machine learning model.
- Know how to implement Scikit-learn algorithms.
- Understand machine learning workflows.
- Understand steps in dataset preprocessing.
How the Scikit-learn Pipeline works
Scikit-learn Pipeline is a powerful tool that automates the machine development stages. It has a sequence of transformation methods followed by a model estimator function assembled and executed as a single process to produce a final model.
The Scikit-learn Pipeline steps are in two categories:
This step contains all the Scikit-Learn methods and classes that perform data transformation. Data transformation is an important stage in machine learning.
It converts the raw dataset into a format that the model can understand and easily use. In addition, data transformation performs feature engineering and dataset preprocessing.
Feature engineering gets relevant and unique attributes from the dataset called features. The model then uses the features as input during training. Dataset preprocessing involves cleaning, formatting, and removing noise from the dataset.
Some of the most common activities involved in dataset preprocessing are as follows:
Removing outliers: Outliers are data points that deviate from the other observations in the dataset. It ensures we have data points that conform to the expected behaviour of the dataset.
Imputing missing values: Dataset imputation replaces missing values in a dataset with some generated values. It ensures that we have a complete dataset before feeding it to the model.
Dataset standardization: Dataset standardization transforms a dataset to fit within a specific range/scale. For example, you can scale a dataset to fit within a range of 0-1 or -1-1. It will ensure that our dataset values have a unit variance of 1 and a mean of 0.
For a better understanding of the dataset standardization, you could read this article.
- Handling Categorical Variables: In handling categorical values, we convert categorical data into integer values. One-Hot encoding is one of the methods that perform this process.
Estimators take the processed dataset as an input and fit the model into the dataset. Estimators then train the model, which will be used to make predictions.
Estimators are the Scikit-learn algorithms that perform classification, regression, and clustering. Common estimators are Logistic Regression, Decision Tree Classifier, K-NN clustering algorithm, Naive Bayes algorithm, and Random Forest Classifier.
Benefits of using Scikit-learn Pipeline
- Faster model implementation through automation.
- It produces models with a very high accuracy score.
- Model debugging to remove errors during model training.
- Produces a more robust and scalable model.
We will use the telecommunication dataset that contains information about their customers. This dataset will train a customer churn model. To download the dataset, use this link.
After downloading the dataset, we load the dataset using Pandas. To import Pandas, use this code:
import pandas as pd
We can use Pandas to load the dataset.
We will view the loaded dataset using this command:
Let us now start automated exploratory data analysis using the Pandas Profiling.
Automated Exploratory Data analysis with Pandas Profiling
To install the Pandas Profiling library, use this command:
!pip install -U pandas-profiling
We will use Pandas Profiling to generate a profile report. The report will give the dataset overview and dataset variables.
We generate the profile report using this code:
profile = ProfileReport(df, title='Churn Data Report', explorative=True) profile.to_notebook_iframe()
The title of the generated report will be
Churn Data Report. The profile report will have the following sections:
The overview section produces the following output:
From the generated report, the dataset has 21 variables and 7043 observations/data points. The dataset has no missing values and duplication rows. The image also shows the variable types, which are categorical (13), boolean (6), and numerical (2).
This section shows all the dataset variables. In addition, it provides useful characteristics and information about the variables.
The outputs below show some of the important variables:
customerID and gender
SeniorCitizen and partner
Dependents and tenure
InternetService and OnlineSecurity
MonthlyCharges and TotalCharges
The interaction section has the following output:
The interaction section shows the relationship between two variables using a scatter plot. For example, the image above shows the relationship between
The correlation section shows the relationship between the dataset variables using Seaborn’s heatmap. Pandas Profiling allows toggling between the four main correlations plots.
These plots are the Phik (φk), Kendall’s τ, Spearman’s ρ, and Pearson’s r.
The correlations section produces the following output:
The image above shows the
Phik (φk) correlation plot. We can easily toggle between the four main correlations plots to view the plots. By clicking the
Toggle correlations descriptions button, we will view a detailed description of each correlation plot.
This section shows if there are missing values in the dataset.
The image shows the number of data points in each variable. All the variables have the same number of data points (7043). It shows there are no missing values in the dataset.
This section displays the first 10 rows and the last 10 rows of our dataset.
First 10 rows
Last 10 rows
This marks the end of automated Exploratory Data Analysis using the Pandas Profiling. The library provides a descriptive analysis of our dataset and better understands the churn dataset. Let us now specify the X and y variables of our dataset.
X and y varables
The X variables represent all the independent variables in a dataset which are the model inputs. The y variable is dependent, which is the model output.
To add the X and y variables, use this code:
X = df.drop(columns=['Churn']) y = df['Churn']
From the code above, the
Churn variable is the
y variable, and the remaining variables are the
Let us import the method used for dataset splitting.
from sklearn.model_selection import train_test_split
We will split the dataset into two sets using the following code:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=124)
test_size=0.30 from the code above, which is the splitting ratio. 70% of the dataset will be for model training and 30% for model testing.
Using Pandas Profiling, we were able to see that the dataset has three variable types. The variable types are: categorical (13), boolean (6) and numerical (2).
We need to specify the columns that belong to these variable types. We use the code below:
numeric_features = ['tenure', 'TotalCharges'] categorical_features = ['SeniorCitizen', 'Partner', 'Dependents', 'PhoneService', 'InternetService','OnlineSecurity','OnlineBackup','DeviceProtection','TechSupport','StreamingTV','StreamingMovies','Contract']
The code selects the columns that have categorical and numerical values.
Selecting the unused columns
We select all the unused columns using the following code:
drop_feat= ['customerID','gender','PhoneService','MultipleLines', 'PaperlessBilling','PaymentMethod']
We will drop the selected columns from our dataset. To drop this column, we will use one of the Scikit-learn Pipeline transformer methods. Let us first import all the transformer methods and classes.
Importing transformer methods and classes
As mentioned earlier, the Scikit-learn Pipeline steps has two categories. Transformers and Estimators. The pipeline will have a sequence of transformers followed by a final estimator.
We create transformers using various Sckit-learn methods and classes which perform data transformation. Let us import all the transformer methods and classes we will use in this tutorial.
from sklearn.compose import ColumnTransformer from sklearn.impute import SimpleImputer from sklearn.preprocessing import StandardScaler, OneHotEncoder
From the code above, we have imported the following transformer methods:
- ColumnTransformer: It is a Scikit-learn class that applies the transformers to our columns. It also combines various transformers into a single transformer.
- SimpleImputer: It is a Scikit-learn class that imputes missing values. It will replace missing values in a dataset with some generated values.
- StandardScaler: It performs dataset standardization. It will ensure that our dataset values have a unit variance of 1 and a mean of 0.
For a better understanding of the dataset standardization process, read this article
OneHotEncoder: It performs categorical encoding. The method converts categorical data into integer values using a one-hot scheme.
For a better understanding of OneHotEncoder, read this article.
Let us now create our first transformer using these methods.
Drop columns transformer
The first transformer will drop the unused columns.
drop_transformer = ColumnTransformer(transformers=[('drop_columns', 'drop', drop_feat)], remainder='passthrough')
The unused columns are in the
drop_feat variable. The
remainder='passthrough' will enable the model to use the remaining columns in the dataset.
We will then add the
drop_transformer to the
Pipeline class. However, first, let us import the
Pipeline class from Scikit-learn.
Import the ‘Pipeline’ class
We import the
Pipeline class as follows:
from sklearn.pipeline import Pipeline
Pipeline assembles all the initialized transformers and the final estimator. It then executes them as a single process to produce a final model.
To add the
drop_transformer, use this code:
pipeline = Pipeline([('drop_column', drop_transformer)])
Next, we fit the pipeline.
It fits the model to the training set. It produces the following output:
The output shows the
drop_transformer added to the
Pipeline class. The next step is to use the
transform method to drop the unused columns.
To see the transformed dataset, use this code:
The output is shown below:
The numeric transformers will perform data imputation and standardization.
To initialize these transformers, use this code:
numeric_transformer = Pipeline(steps=[ ('meanimputer', SimpleImputer(strategy='mean')), ('stdscaler', StandardScaler()) ])
From the code above,
SimpleImputer will perform data imputation.
strategy='mean' replaces the missing values using the generated mean in each column. The
StandardScaler() method performs data standardization.
It will handle the categorical values in the dataset. Then, we will use the
OneHotEncoder method to convert the categorical data into integer values.
categorical_transformer = Pipeline(steps=[ ('onehotenc', OneHotEncoder(handle_unknown='ignore')) ])
We then combine these initialized transformers.
Combining the initialized transformers
We use the following code:
col_transformer = ColumnTransformer(transformers=[('drop_columns', 'drop', drop_feat), ('numeric_processing',numeric_transformer, numeric_features), ('categorical_processing', categorical_transformer, categorical_features) ], remainder='drop')
We have used
ColumnTransformer to combine all the initialized transformers.
numeric_processing transforms the
numeric_features , while
categorical_processing transforms the
categorical_features. We save the final transformer in the
Let us add it to the
Adding the ‘col_transformer’ transformer
To add the
Pipeline class, use this code:
pipeline = Pipeline([('transform_column', col_transformer)])
Next, we fit the pipeline to the train set.
It produces the following output:
The image above shows all the added transformers. The next step is to use the
transform method to apply the transformers to the columns.
Applying the transformers
To see the transformed dataset, use this code:
Applying the transformers to test dataset
Use this code:
To see the transformed test dataset, use this code:
Adding the final estimator
The last step in the Scikit-learn Pipeline is to add an estimator. An estimator is an algorithm that trains the machine learning model. We will use the
LogisticRegression as the estimator.
from sklearn.linear_model import LogisticRegression
To add the estimator to the
Pipeline class, use this code:
pipeline = Pipeline([ ('transform_column', col_transformer), ('logistics', LogisticRegression()) ])
From the image above, the
Pipeline class has all the transformers (col_transformer) and the final estimator (LogisticRegression).
Let us fit the model into the dataset.
Fitting the pipeline
To fit the pipeline, use this code:
The pipeline will identify patterns in the training set.
Getting accuracy score on the training set
To get the accuracy score, use the following code:
The output is shown below:
It is a good accuracy score and shows the model has a 79.533% chance of making correct predictions.
Let us evaluate the model using the testing set.
Getting accuracy score on the testing set
We get the accuracy score using the following code:
The output is shown below:
When we compare the two accuracy scores, the accuracy score on the testing set is better. It shows that the model still performs well using the testing set, which is new to the model.
In this tutorial, we learned how to build a machine learning model using Pandas Profiling and Scikit-learn Pipeline. The tutorial explained how the Scikit-learn Pipeline works and the key pipeline steps. Pandas Profiling generated a profile report that shows the dataset overview.
After this process, we implemented our transformers using Scikit-learn transformer methods and classes. Then, we stacked these transformers together, and the other added a final estimator. Finally, we trained the customer churn model using the Telecommunication dataset.
The model had a good accuracy score using the training and the testing dataset. To access the complete Google Colab notebook for this tutorial, click here.
- Scikit-learn Pipeline documentation
- Pipelines and composite estimators
- Scikit-learn ColumnTransformer
- Categorical encoding
- Scikit-learn OneHotEncoder
- Dataset standardization
- Introductioon to Pandas Profiling
Peer Review Contributions by: Jerim Kaura