EngEd Community

Section’s Engineering Education (EngEd) Program fosters a community of university students in Computer Science related fields of study to research and share topics that are relevant to engineers in the modern technology landscape. You can find more information and program guidelines in the GitHub repository. If you're currently enrolled in a Computer Science related field of study and are interested in participating in the program, please complete this form .

Getting Started with Polynomial Regression in R

July 30, 2021

Polynomial regression is used when there is a non-linear relationship between dependent and independent variables. Examples of cases where polynomial regression can be used include modeling population growth, the spread of diseases, and epidemics.

Such trends are usually regarded as non-linear.

The general form of a polynomial regression model is:

y = β0 + β1X + β2X2 +……..+ ε

For example, a polynomial model of 2 degrees can be written as:

y = β0 + β1X + β2X2 + ε

Now that we know what Polynomial Regression is, let’s use this concept to create a prediction model.

Prerequisites

A general understanding of R and the Linear Regression Model will be helpful for the reader to follow along.

Step 1 - Data preprocessing

The dataset used in this article can be found here.

The first step we need to do is to import the dataset, as shown below:

dataset = read.csv('salaries.csv')  

This is how our dataset should look like:

Dataset output

In the dataset above, we do not need column 1 since it only contains the names of each entry.

To remove column 1 from our dataset, we simply run the following code:

dataset= dataset[2:3]

Our dataset should now look like this:

Newdataset output

To determine whether a polynomial model is suitable for our dataset, we make a scatter plot and observe the relationship between salary (dependent variable) and level (independent variable).

library(ggplo2)
ggplot() +
  geom_point(aes(x = dataset$Level, y = dataset$Salary),
             colour = 'red')

Our scatter plot should look as shown below:

Plot of Salary against levels

From the analysis above, it’s clear that salary and level variables have a non-linear relationship. Therefore, a polynomial regression model is suitable.

The second step in data preprocessing usually involves splitting the data into the training set and the dataset. In our case, we will not carry out this step since we are using a simple dataset.

The lm function has also allowed us to take care of feature scaling.

Step 2 - Fitting the polynomial regression model

The polynomial regression model is an extension of the linear regression model. The only difference is that we add polynomial terms of the independent variables (level) to the dataset to form our matrix.

This is demonstrated below:

dataset$Level2 = dataset$Level^2
dataset$Level3 = dataset$Level^3
dataset$Level4 = dataset$Level^4

Our new dataset will look like this:

Newdataset added levels

As stated, to fit the polynomial model, we use the lm function, as highlighted below:

poly_reg = lm(formula = Salary ~ .,data = dataset)

After completing the polynomial model, we use the following code to evaluate its effectiveness:

summary(poly_reg)

polynomial regression summary results

From the results above, the model is quite good due to its 99.53% accuracy.

Step 3 - Visualizing of the model

We use the ggplot2 library to visualize our model, as demonstrated below:

library(ggplot2)
x_grid = seq(min(dataset$Level), max(dataset$Level), 0.1)
ggplot() +
  geom_point(aes(x = dataset$Level, y = dataset$Salary),
             colour = 'red') +
  geom_line(aes(x = x_grid, y = predict(poly_reg,
                                        newdata = data.frame(Level = x_grid,
                                                             Level2 = x_grid^2,
                                                             Level3 = x_grid^3,
                                                             Level4 = x_grid^4))),
            colour = 'blue') +
  ggtitle('Truth or Bluff (Polynomial Regression)') +
  xlab('Level') +
  ylab('Salary')

Below are the results obtained from this analysis:

visualization of the polynomial regression

From the graph above, we can see that the model is nearly perfect. It fits the data points appropriately. Therefore, we can use the model to make other predictions.

Step 4 - Making predictions using the polynomial regression model

Now that we have developed the model, it’s time to make some predictions.

Assuming that you would like to predict the salary of an employee whose level is 7.5. To do this, we use the predict() function, as highlighted below.

# Predicting a new result with the polynomial regression
predict(poly_reg, data.frame(Level = 7.5,
                             Level2 = 7.5^2,
                             Level3 = 7.5^3,
                             Level4 = 7.5^4))

Output:

225126.3

The salary of an employee with a level of 3.7 is calculated, as shown below:

predict(poly_reg, data.frame(Level = 3.7,
                             Level2 = 3.7^2,
                             Level3 = 3.7^3,
                             Level4 = 3.7^4)

The result is:

84363.82 

The next step is to examine the effect of additional degrees on our polynomial model:

dataset$Level5 = dataset$Level^5

Let’s build a new model with a Level5 column added and then examine its effects:

poly_reg2 = lm(formula = Salary ~ .,data = dataset)

predict(poly_reg2, data.frame(Level = 7.5,
                             Level2 = 7.5^2,
                             Level3 = 7.5^3,
                             Level4 = 7.5^4
                             Level5= 7.5^5))

The employee’s salary is predicted to be 237446 as compared to the 225123.3 we had obtained from the model with 4 degrees.

Generally, the more degrees the polynomial regression model has, the more accurate its predictions are.

Conclusion

From this article, you have learned how to analyze data using polynomial regression models in R. You can use this knowledge to build accurate models to predict disease occurrence, epidemics, and population growth.

Happy coding!


Peer Review Contributions by: Wanja Mike