EngEd Community

Section’s Engineering Education (EngEd) Program fosters a community of university students in Computer Science related fields of study to research and share topics that are relevant to engineers in the modern technology landscape. You can find more information and program guidelines in the GitHub repository. If you're currently enrolled in a Computer Science related field of study and are interested in participating in the program, please complete this form .

Applying AI and Machine Learning to Predict Consumer Behavior

July 14, 2021

In this article, we will learn and analyze general consumer behavior. We will also understand how Artificial Intelligence has helped in uncovering valuable insights, that led the companies to make the right decisions, for the vision of providing better value and generating better revenue.

We will also analyze this using a case study, where we use data science and analytics to uncover valuable insights for deriving better solutions.


As a prerequisite, the reader must have a little understanding of Python, and machine learning.

What is Artificial Intelligence?

Artificial Intelligence is the ability of a machine to learn like a human, thus achieving the level of human intelligence and much more.

With advancements in the field of AI, it has led to improvements across several industries like Automation, Supply chain, eCommerce, Manufacturing, and many more.

Not only that, sub-parts of AI i.e., Data Science and Machine Learning have enabled businesses to make the right decisions. In simpler words, for improving the revenue of an eCommerce store, we could analyze and provide personalized recommendations to customers based on their likes and dislikes, most frequently purchased items, previous searches, correlations between item purchases, and many more.

AI has played a significant role in eCommerce by planning inventory, logistics, finding trends, patterns, predicting future outcomes based on historical trends, inform fact-based decisions, etc.

Understanding consumer behavior

Consumer Behavior, in its broadest sense, is concerned with how consumers select, decide, use and dispose of goods and services. It covers individuals, groups, or organizations of any verticals.

It gives a good idea and insights about consumer’s emotions, attitudes, and preferences which affect buying behavior. Thus, helping marketers to understand the needs of customers, bringing value to the customers, and in return generating revenue for the company.

Predicting the consumer behavior

Big companies understand that predicting customer behavior fills the gap in the markets and identifies products that are needed and which could generate bigger revenue.

Consumer behavior prediction can be done by:

  1. Segmentation: separating customers into smaller groups based on buying behaviors. This helps in the separation of concerns, which in turn helps us identify the region of the market.
  2. Predictive Analytics: we use statistical techniques to analyze previous historical data to predict the future behavior of customers.

Step by step implementation

Now, let’s understand this is done, using a real-time example.

Understanding dataset

In this dataset, we have information related to customers like:

  • CustomerID - ID of the customer
  • Gender - Gender of the customer
  • Age - Age of the customer
  • AnnualIncome - annual income of the customer
  • SpendingScore - score assigned based on the customer’s behavior and their purchasing data

You can download the dataset here.


The objective of this tutorial is to understand the behaviors of your customer based on their purchasing data. This helps the marketing team to understand and plan new strategies accordingly.

Importing libraries

For data exploration, it is mandatory to have a few Python libraries installed.

You can download Python using this link.

The libraries to download are:

import numpy as np
import pandas as pd
import sklearn
import matplotlib.pyplot as plt
import seaborn as sns

View dataset

Before we start, let’s have a look at the dataset. To view the dataset, we have to import by reading the CSV file as shown below:

df = pd.read_csv(r'../input/Mall_Customers.csv')

First 5 rows of the dataset First 5 rows of the dataset

Data visualizations

Correlation between Age, Income and Spending scores

A better strategy to marketing is to analyze the spending patterns. Here, let’s try to analyze and find how age, annual incomes and spending scores of the customers are.

plt.figure(1 , figsize = (15 , 6)) # sets the dimensions of image
n = 0 
for x in ['Age' , 'Annual Income (k$)' , 'Spending Score (1-100)']:
    n += 1
    plt.subplot(1 , 3 , n) # creates 3 different sub-plots
    plt.subplots_adjust(hspace =0.5 , wspace = 0.5)
    sns.distplot(df[x] , bins = 20) # creates a distribution plot
    plt.title('Distplot of {}'.format(x)) # sets title for each plot
plt.show() # displays all the plots


Distribution plots of Age, Annual Income and Spending scores Distribution plots of Age, Annual Income, and Spending scores

Gender analysis

The second most important thing in deciding the strategy is to analyze the spending patterns based on Gender. Here, we find that Females tend to purchase more than Males do.

plt.figure(1 , figsize = (15 , 5))
sns.countplot(y = 'Gender' , data = df)


Count plot describing the Males’ and Females’ spending patterns Count plot describing the Males’ and Females’ spending patterns

Customer segmentation

Segmentation helps in dividing a set of large data into groups of smaller observations that are similar in specific ways relevant to marketing.

Each group contains individuals that are similar in-between themselves, and different from individuals from the other groups.

Segmentation is widely used as a marketing tool to create clusters of clients and adapt a relevant strategy for each of them.

Here, we will learn to segment this data based on several factors and understand how it helps in improving the existing strategy.

Segmentation using Age and Spending score

Let’s try to segment the customers based on their age and their spending scores. This helps us understand the age category of the customers, which could possibly improve spending score, thereby increasing the revenue for the company.

Here, we have to decide the possible number of clusters (segments) that would return the best results. To do that, we loop through 1 to 11, and find which cluster would be the right choice.

X_age_spending = df[['Age' , 'Spending Score (1-100)']].iloc[: , :].values # extracts only age and spending score information from the dataframe
inertia = []
for n in range(1 , 11):
    model_1 = (KMeans(n_clusters = n ,init='k-means++', n_init = 10 , max_iter=300, 
                        tol=0.0001,  random_state= 111  , algorithm='elkan')) # use predefined Kmeans algorithm
    model_1.fit(X_age_spending) # fit the data into the model

To read more about KMeans algorithm refer this documentation. And, to understand working of the algorithm, refer here.

Let’s visualize this via a graph:

plt.figure(1 , figsize = (15 ,6)) # set dimension of image
plt.plot(np.arange(1 , 11) , inertia , 'o') # Mark the points with a solid circle
plt.plot(np.arange(1 , 11) , inertia , '-' , alpha = 0.5) # connect remaining points with a line
plt.xlabel('Number of Clusters') , plt.ylabel('Inertia') # label the x and y axes
plt.show() # display

Line graph displaying clusters Line graph displaying clusters

As you may notice that after cluster 4, the line graph starts becoming stable. This method is known as Elbow method.

Now, let’s explore more with having 4 clusters.

model_2 = (KMeans(n_clusters = 4 ,init='k-means++', n_init = 10 ,max_iter=300, 
                        tol=0.0001,  random_state= 111  , algorithm='elkan') ) # set number of clusters as 4
model_2.fit(X_age_spending) # fit the model
labels1 = model_2.labels_
centroids1 = model_2.cluster_centers_

Let’s visualize them now:

Before that, there some prerequisites for plotting a graph - like setting the maximum and minimum ranges of values, initializing a meshgrid(), and so on.

h = 0.02
x_min, x_max = X_age_spending[:, 0].min() - 1, X_age_spending[:, 0].max() + 1
y_min, y_max = X_age_spending[:, 1].min() - 1, X_age_spending[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))
Z = model_2.predict(np.c_[xx.ravel(), yy.ravel()])  # returns flattened 1D array

You can read more about Meshgrids here.

Now, let’s plot the graph:

plt.figure(1 , figsize = (15 , 7) )
Z = Z.reshape(xx.shape)
plt.imshow(Z , interpolation='nearest', 
           extent=(xx.min(), xx.max(), yy.min(), yy.max()),
           cmap = plt.cm.Pastel2, aspect = 'auto', origin='lower')

plt.scatter( x = 'Age' ,y = 'Spending Score (1-100)' , data = df , c = labels1 , 
            s = 200 )
plt.scatter(x = centroids1[: , 0] , y =  centroids1[: , 1] , s = 300 , c = 'red' , alpha = 0.5)
plt.ylabel('Spending Score (1-100)') , plt.xlabel('Age')


KMeans with 4 clusters KMeans with 4 clusters

From the above plot, we can infer many information about the spending patterns:

  • The average spending score irrespective of age would be around 20
  • In the topmost cluster, Customers below age 40 has the highest spending scores. The cluster is less sparse.
  • Above age 40, the spending score remains consistently within the range of 30 - 60.

More insights about these data could be extracted with deeper data analysis by correlating with all possible parameters that are directly or indirectly related.


As we learned from the above simple case-study, we find that AI has played a significant role in almost all the industries. With rise in the trend of data analysis, the customers behavior is being continuously monitored for improving strategies and taking better decisions.

This article acts only as a guide for beginners, to get them started in this field.

At Section, you can find related topics below:


Peer Review Contributions by: Srishilesh P S