Getting Started with Polynomial Regression in R
July 30, 2021
Polynomial regression is used when there is a non-linear relationship between dependent and independent variables. Examples of cases where polynomial regression can be used include modeling population growth, the spread of diseases, and epidemics.
Such trends are usually regarded as non-linear.
The general form of a polynomial regression model is:
y = β0 + β1X + β2X2 +……..+ ε
For example, a polynomial model of
2 degrees can be written as:
y = β0 + β1X + β2X2 + ε
Now that we know what Polynomial Regression is, let’s use this concept to create a prediction model.
A general understanding of R and the Linear Regression Model will be helpful for the reader to follow along.
Step 1 - Data preprocessing
The dataset used in this article can be found here.
The first step we need to do is to import the dataset, as shown below:
dataset = read.csv('salaries.csv')
This is how our dataset should look like:
In the dataset above, we do not need
column 1 since it only contains the names of each entry.
column 1 from our dataset, we simply run the following code:
Our dataset should now look like this:
To determine whether a polynomial model is suitable for our dataset, we make a scatter plot and observe the relationship between
salary (dependent variable) and
level (independent variable).
library(ggplo2) ggplot() + geom_point(aes(x = dataset$Level, y = dataset$Salary), colour = 'red')
Our scatter plot should look as shown below:
From the analysis above, it’s clear that
level variables have a non-linear relationship. Therefore, a polynomial regression model is suitable.
The second step in data preprocessing usually involves splitting the data into the
training set and the
dataset. In our case, we will not carry out this step since we are using a simple dataset.
lm function has also allowed us to take care of feature scaling.
Step 2 - Fitting the polynomial regression model
The polynomial regression model is an extension of the linear regression model. The only difference is that we add
polynomial terms of the
independent variables (level) to the dataset to form our matrix.
This is demonstrated below:
dataset$Level2 = dataset$Level^2 dataset$Level3 = dataset$Level^3 dataset$Level4 = dataset$Level^4
Our new dataset will look like this:
As stated, to fit the polynomial model, we use the
lm function, as highlighted below:
poly_reg = lm(formula = Salary ~ .,data = dataset)
After completing the polynomial model, we use the following code to evaluate its effectiveness:
From the results above, the model is quite good due to its 99.53% accuracy.
Step 3 - Visualizing of the model
We use the ggplot2 library to visualize our model, as demonstrated below:
library(ggplot2) x_grid = seq(min(dataset$Level), max(dataset$Level), 0.1) ggplot() + geom_point(aes(x = dataset$Level, y = dataset$Salary), colour = 'red') + geom_line(aes(x = x_grid, y = predict(poly_reg, newdata = data.frame(Level = x_grid, Level2 = x_grid^2, Level3 = x_grid^3, Level4 = x_grid^4))), colour = 'blue') + ggtitle('Truth or Bluff (Polynomial Regression)') + xlab('Level') + ylab('Salary')
Below are the results obtained from this analysis:
From the graph above, we can see that the model is nearly perfect. It fits the data points appropriately. Therefore, we can use the model to make other predictions.
Step 4 - Making predictions using the polynomial regression model
Now that we have developed the model, it’s time to make some predictions.
Assuming that you would like to predict the
salary of an employee whose
7.5. To do this, we use the
predict() function, as highlighted below.
# Predicting a new result with the polynomial regression predict(poly_reg, data.frame(Level = 7.5, Level2 = 7.5^2, Level3 = 7.5^3, Level4 = 7.5^4))
salary of an employee with a
3.7 is calculated, as shown below:
predict(poly_reg, data.frame(Level = 3.7, Level2 = 3.7^2, Level3 = 3.7^3, Level4 = 3.7^4)
The result is:
The next step is to examine the effect of additional degrees on our polynomial model:
dataset$Level5 = dataset$Level^5
Let’s build a new model with a
Level5 column added and then examine its effects:
poly_reg2 = lm(formula = Salary ~ .,data = dataset) predict(poly_reg2, data.frame(Level = 7.5, Level2 = 7.5^2, Level3 = 7.5^3, Level4 = 7.5^4 Level5= 7.5^5))
The employee’s salary is predicted to be
237446 as compared to the
225123.3 we had obtained from the model with 4 degrees.
Generally, the more degrees the polynomial regression model has, the more accurate its predictions are.
From this article, you have learned how to analyze data using polynomial regression models in R. You can use this knowledge to build accurate models to predict disease occurrence, epidemics, and population growth.
Peer Review Contributions by: Wanja Mike