Introduction to Scikit Learn in Python

January 25, 2021

The concept of machine learning has been booming over the past few years, and more often than not, graduate students and industry professionals have made a career switch to data science or machine learning. An essential ingredient for establishing familiarity in this field is to know your libraries and dependencies.

Introduction to Scikit-Learn in Python

A significant chunk of your work goes towards having the right approach towards the problem and manipulating the dataset regarding your approach. This multi-part article introduces the reader to SciKit-Learn, a vital library used to build statistical models to make predictions.

Prerequisites

The reader is expected to understand basic libraries like NumPy and Pandas, machine learning, and machine learning algorithms, including linear and logistic regression, support vector machines and decision trees, and boosting algorithms.

For a better understanding, the reader is advised to go through the following articles on Python, NumPy, Matplotlib and SciPy.

Table of contents

  1. Introduction
  2. Installing the Scikit-Learn Library
  3. Dataset Transformations using sklearn
  4. Scikit-Learn for Standardization
  5. Scikit-Learn for Normalization
  6. Scikit-Learn when Encoding Categorical Features
  7. Scikit-Learn when Filling Missing Values
  8. Conclusion
  9. Further Readings

Introduction

SciKit-Learn (often referred to as sklearn) provides a wide array of statistical models and machine learning. sklearn, unlike most modules, is written in Python and not in C. Although it is written in Python, sklearn’s performance is attributed to its usage of NumPy for high-performance linear algebra and array operations.

SciKit-Learn was written as a part of Google’s Summer of Code project and has since made lives easier for thousands of Python centered data scientists worldwide. This part of the series focuses on introducing the library and focusing on one aspect - dataset transformations, an important and crucial step to go through before building a prediction model.

Installing the Scikit-Learn library

Scikit-Learn requires the following libraries to be pre-installed: NumPy, SciPy, Matplotlib, IPython, Sympy, and Pandas. Let’s go ahead and install them from the terminal using pip (works only for Windows).

pip install numpy
pip install scipy
pip install matplotlib
pip install ipython
pip install sympy
pip install pandas

Now that we’ve installed the dependent libraries let us install Scikit-Learn.

\>> pip install scikit-learn

Let’s check if Scikit-Learn can be accessed using Python.

import sklearn

Yes, it works!

Dataset transformations using sklearn

An essential component for building a machine learning algorithm is data. A lot of work goes into preparing the data so that it can be fed to the model. This is called data preprocessing. Data preprocessing tasks can range from a mere change in notation to changing a continuous variable to a categorical variable.

The sklearn.preprocessing package provides various functions and classes to change the representation of certain variables to be better suited for the estimators down the model pipeline. So, let’s go ahead and look at the methods that Scikit-Learn offers, that help in data preprocessing and transformation. Before that, let’s import the sklearn.preprocessing package.

from sklearn import preprocessing

Scikit-Learn for standardization

Distance based models are machine learning algorithms that use distances to check if they are similar or not. If two points are close together, one can infer that the feature values are simiar and hence, can be classified as similar. Standardization is an essential task for distance based models so that one particular feature does not dominate over the other.

A data point x is standardized as follows:

Standardization

Where µ is the mean of the distribution and σ is the standard deviation of the distribution. Standardization is centering around zero and scaling the data point such that the mean is 0, and the standard deviation is 1.

This means all the data points now lie between -1 and 1. The reader is encouraged to go through this resource to get a better grip on why to standardize your features.

The following are the temperatures recorded in Bloomington (in Fahrenheits) in Illinois in the month of January:

[33.2,33.1,33.1,33.0,32.9,32.9,32.8,32.8,32.7,32.7,32.6,32.6,32.6,32.6,32.5,32.5,32.5,32.6,32.6,32.6,32.7,32.7,32.8,32.9,33.0,33.1,33.2,33.4,33.5, 33.7, 33.9]

Let us try to standardize this vector.

# Import libraries
from sklearn.preprocessing import StandardScaler
import numpy as np

# List of temperatures recorded in Bloomington
temperatures_list = [33.2,33.1,33.1,33.0,32.9,32.9,32.8,32.8,32.7,32.7,32.6,32.6,32.6,32.6,
                    32.5,32.5,32.5,32.6,32.6,32.6,32.7,32.7,32.8,32.9,33.0,33.1,33.2,33.4,33.5, 33.7, 33.9]

# Convert the list to a NumPy array
temperatures_np = np.array(temperatures_list).reshape(-1,1)

# Standardize the vector
temperatures_std = StandardScaler().fit_transform(temperatures_np)

# Print the means
print("Mean Before Standardizing:",sum(temperatures_list)/len(temperatures_list))
print("Mean After Standardizing:",sum(temperatures_std.reshape(1,-1)[0])/len(temperatures_std))

# Output:
# Mean Before Standardizing: 32.896774193548396
# Mean After Standardizing: -2.6215588839535955e-15

Notice that after standardizing the data, the mean is almost 0.

In the example above, fit_transform() is used. There are two important functions - fit() and fit_transform(). fit() is used to compute the mean and standard deviation, that is later used for scaling along the feature axis and fit_transform() computes the mean and standard deviation, scales the vector, and returns a NumPy array of the computed values. Therefore, standardization can either be done using fit() and transform() or in one single optimized step, fit_transform().

Scikit-Learn for normalization

Normalization is another feature scaling technique used to transform the values of the numeric attributes to a standard scale (0 to 1). Normalization is used in cases where the values do not follow Gaussian distribution. (Rule of thumb - Standardize if the attribute can be modeled to be a Gaussian distribution. If not, normalize).

Normalization is important because it does not provide a window for the model to prefer one attribute because of the scale of values. This resource by DataScienceDojo explains normalization with an easy-to-understand example.

A data point x is normalized as follows:

Normalization

Source: Miro Medium

Let’s try to normalize the data using the same set of values used in the previous example.

#Import libraries
from sklearn.preprocessing import MinMaxScaler
import numpy as np

#List of temperatures recorded in Bloomington
temperatures_list = [33.2,33.1,33.1,33.0,32.9,32.9,32.8,32.8,32.7,32.7,32.6,32.6,32.6,32.6,
                    32.5,32.5,32.5,32.6,32.6,32.6,32.7,32.7,32.8,32.9,33.0,33.1,33.2,33.4,33.5, 33.7, 33.9]

#Convert the list to a NumPy array
temperatures_np = np.array(temperatures_list).reshape(-1,1)

#Normalize the vector
temperatures_norm = MinMaxScaler().fit_transform(temperatures_np)

print("Minimum Value Before Normalization:",min(temperatures_np.reshape(1,-1)[0]))
print("Maximum Value Before Normalization:",max(temperatures_np.reshape(1,-1)[0]))
print("Minimum Value After Normalization:",min(temperatures_norm))
print("Maximum Value After Normalization:",max(temperatures_norm))

# Output:
# Minimum Value Before Normalization: 32.5
# Maximum Value Before Normalization: 33.9
# Minimum Value After Normalization: [0.]
# Maximum Value After Normalization: [1.]

Scikit-Learn when encoding categorical features

Almost every dataset has a feature (or more than one feature), that is categorical in nature. Consider a dataset containing the details of all the passengers of a certain airline. The possible categorical variables in the dataset could be the passenger’s gender (male/female) and their seating choice (economy, business, first-class). Estimators take in only numerical data, and hence, these categorical features have to be encoded.

There are 2 types of encoding - Label Encoding and One Hot Encoding

Summarizing the above resources with an example, assume a dataset of car information with the feature “Manufacturer,” and there are three car manufacturers - Ford, Hyundai, and Tata.

Label Encoding would mean replacing all “Ford” with 0, all “Hyundai” with 1, and all “Tata” with 2, and one hot encoding would have three more features, 1 representing if the manufacturer was indeed that company, 0 indicating otherwise.

from sklearn.preprocessing import LabelEncoder

bands = ["Pink Floyd","Led Zeppelin","Pink Floyd","Foo Fighters","Queen","Queen","Pink Floyd","AC/DC","Foo Fighters","Led Zeppelin","Queen",
           "Nirvana","AC/DC","The Doors","Queen","Fleetwood Mac","Nirvana"]

# Invoking an instance of Label Encoder
label_encoding = LabelEncoder()

# Fit the labels
encoded = label_encoding.fit(bands)

print(encoded.transform(bands))

# Output - [5 3 5 2 6 6 5 0 2 3 6 4 0 7 6 1 4]

If one were to look at the output, they would understand that the feature has been encoded. But mere numbers do not make any sense. Luckily, classes_ help us interpret what these labels are.

#Iterate through the classes_ list and print them
band_list = encoded.classes_

for band_number in range(1,len(band_list)+1):
    print(band_number, band_list[band_number-1])

# Output
# 1 AC/DC
# 2 Fleetwood Mac
# 3 Foo Fighters
# 4 Led Zeppelin
# 5 Nirvana
# 6 Pink Floyd
# 7 Queen
# 8 The Doors

Note that the labels have been encoded in ascending order.

If the band_list feature is one-hot encoded, it would be represented in 1’s and 0’s instead of decimals.

import numpy as np
from sklearn.preprocessing import OneHotEncoder

band_list = np.array(["AC/DC","Fleetwood Mac","Foo Fighters","Led Zeppelin","Nirvana","Pink Floyd","Queen","The Doors"]).reshape(-1,1)

# Invoking an instance of Label Encoder
label_encoding = OneHotEncoder()

# Fit the labels
encoded = label_encoding.fit(band_list)

print(encoded.transform(band_list).toarray())

# Output
# [[1. 0. 0. 0. 0. 0. 0. 0.]
#  [0. 1. 0. 0. 0. 0. 0. 0.]
#  [0. 0. 1. 0. 0. 0. 0. 0.]
#  [0. 0. 0. 1. 0. 0. 0. 0.]
#  [0. 0. 0. 0. 1. 0. 0. 0.]
#  [0. 0. 0. 0. 0. 1. 0. 0.]
#  [0. 0. 0. 0. 0. 0. 1. 0.]
#  [0. 0. 0. 0. 0. 0. 0. 1.]]

Scikit-Learn when filling missing values

Almost 70% of time and resources are spent on collecting and cleaning the dataset for every project. When one deals with a real-life dataset, there are always missing values. Cleaning the dataset and handling missing data is important as many machine learning algorithms do not accommodate a missing attribute in the data.

This is where Scikit-Learn’s impute module comes into play. A simple way to deal with missing values is to remove the row of data with a missing value, that would mean losing valuable-yet-incomplete data. A better way is to replace the missing values with values that can be inferred from known data. One way would be to replace the missing data with the mean of that column.

Missing values are encoded with NumPy’s NaN (numpy.nan)

The following are the temperatures recorded in Bloomington (in Fahrenheits) in Illinois in the month of February:

[33.2,32.8,32.9,33.0,nan,33.2,33.4,33.1,32.6,32.5,32.5,33.1,33.0,nan,32.7,32.7,32.6,nan,32.6,32.9,32.8,32.8,32.5,32.6,nan,32.6,32.7,32.7,33.5, 33.7,33.9].

Let’s try to replace the missing temperatures with their mean.

import numpy as np
from sklearn.impute import SimpleImputer

#List of temperatures
temperatures = [33.2,32.8,32.9,33.0,"NaN",33.2,33.4,33.1,32.6,32.5,32.5,33.1,33.0,"NaN",32.7,32.7,32.6,"NaN",32.6,32.9,32.8,
                32.8,32.5,32.6,"NaN",32.6,32.7,32.7,33.5, 33.7,33.9]

temperatures_cleaned = []

#Replace NaN's with np.nan
for temperature in temperatures:
    if temperature=="NaN":
        temperatures_cleaned.append(np.nan)
    else:
        temperatures_cleaned.append(temperature)

temperatures_np = np.array(temperatures_cleaned).reshape(-1,1)

# Create an instance of the imputer
imputer_mean = SimpleImputer(missing_values=np.nan,strategy='mean')

#Transform the array and fit according to the chosen strategy
temperatures_np = imputer_mean.fit_transform(temperatures_np)

print(*temperatures_np, sep=", ")

# Output - [33.2], [32.8], [32.9], [33.], [32.91111111], [33.2], [33.4], [33.1], [32.6], [32.5], [32.5], [33.1], [33.], [32.91111111], 
#          [32.7], [32.7], [32.6], [32.91111111], [32.6], [32.9], [32.8], [32.8], [32.5], [32.6], [32.91111111], [32.6], 
#          [32.7], [32.7], [33.5], [33.7], [33.9]

SimpleImputer provides four options for strategy - mean, median, most_frequent, and constant. Since mean was the chosen strategy, the nan’s were replaced with the mean of the temperatures (32.91111111).

Had most_frequent been the chosen category:

# Create an instance of the imputer
imputer_most_frequent = SimpleImputer(missing_values=np.nan,strategy='most_frequent')

#Transform the array and fit according to the chosen strategy
temperatures_np = imputer_most_frequent.fit_transform(temperatures_np)

print(*temperatures_np,sep=", ")

# Output - [33.2], [32.8], [32.9], [33.], [32.6], [33.2], [33.4], [33.1], [32.6], [32.5], [32.5], [33.1], [33.], [32.6], [32.7], 
#          [32.7], [32.6], [32.6], [32.6], [32.9], [32.8], [32.8], [32.5], [32.6], [32.6], [32.6], [32.7], [32.7], [33.5], [33.7], [33.9]

… the nan’s would be replaced with the value with the most occurrences (the mode of the feature) - 32.6.

Opting for constant would require a value for the parameter fill_value.

# Create an instance of the imputer
imputer_constant = SimpleImputer(missing_values=np.nan,strategy='constant',fill_value=32.9)

#Transform the array and fit according to the chosen strategy
temperatures_np = imputer_constant.fit_transform(temperatures_np)

print(*temperatures_np,sep=", ")

# Output - [33.2], [32.8], [32.9], [33.], [32.9], [33.2], [33.4], [33.1], [32.6], [32.5], [32.5], [33.1], [33.], [32.9], [32.7], 
#          [32.7], [32.6], [32.9], [32.6], [32.9], [32.8], [32.8], [32.5], [32.6], [32.9], [32.6], [32.7], [32.7], [33.5], [33.7], [33.9]

Conclusion

This article was a brief dive into the multi-faceted world of scikit-learn. Scikit-Learn is a very important package to have a good understanding of and some experience in within every data scientist’s journey. This article aimed to make the reader comfortable with data manipulation using sklearn and would prove to be a great starting point for Scikit-Learn.

Happy Coding!

Further readings

  1. Official Docs

  2. Medium

  3. Tutorialspoint

  4. Machine Learning Mastery

  5. Data Camp


Peer Review Contributions by: Lalithnarayan C


About the author

Prashanth Saravanan

Prashanth Saravanan is an Electronics and Communication Engineering Undergrad at Amrita Vishwa Vidyapeetham, India. He is a passionate data scientist and loves technology. He’s an avid Tableau developer who designs interactive dashboards, often based on The Office. You can almost always catch him with Pink Floyd on his earphones, collecting vinyls or learning the bass.

This article was contributed by a student member of Section's Engineering Education Program. Please report any errors or innaccuracies to enged@section.io.