# Building an Ensemble Learning Based Regression Model using Python

##### October 19, 2021

Machine learning models are always evaluated based on their performance using specific metrics like; accuracy, precision, Mean Squared Error (MSE), etc. Each type of machine learning problem has its evaluation metrics.

Building high-performance models (models with low errors), therefore, depends on how well the evaluation metric score is. In this tutorial, we will build a performance-driven linear regression model using ensemble learning.

### Prerequisites

To follow through the tutorial, you’ll need:

1. To know the basics of Python.
2. To have a Kaggle account.
3. To know the basics of Machine Learning.

### Introduction

Linear regression is a statistical method of modeling the relationship between independent variables (x) and dependent variables (y). It uses independent variables (features) to predict dependent variables (target).

Ensemble learning is a machine learning technique that seeks to achieve a better predictive model performance by combining decisions from different models.

For our model’s evaluation, we will be using RMSE (Root Mean Squared Error).

NB: Regression problems cannot be measured using accuracy metric since the goal is to measure how close the predicted values are to the expected values and not to evaluate how correct the prediction is. Hence we use errors to evaluate our models.

Before building our model, we will first go to Kaggle and create a new notebook and rename it to Create_Folds.  ### Creating k-folds

Once done with setting up the environment, we will move on to creating k-folds for our dataset.

Cross-validation is a validation technique used to evaluate machine learning models on a finite dataset. It is quite popular as it is easier to understand and results in less biased predictions than other methods like train/test split.

It is also best that you create folds that you’ll be using throughout the modeling process whenever you’re starting with a machine learning problem.

#### Importing the necessary libraries

Before proceeding, we need to import the following necessary libraries:

``````import numpy as np
import pandas as pd
from sklearn import model_selection
``````

We will now proceed to load our dataset into our notebook. We will use pandas library’s `read_csv()` function to read the data as they constitute as `csv` files.

The code is as shown below:

``````train_data = pd.read_csv('/kaggle/input/Dataset/train.csv')
``````

#### Creating the folds

As shown below, let’s create a new column with the name kfold on the last column.

``````train_data['kfold'] = -1
``````

We will then proceed to create 5 folds using the following code block:

``````kf  = model_selection.KFold(n_splits= 5,shuffle = True, random_state=42)

for fold, (train_indicies,valid_indicies)in enumerate(kf.split(X=train_data)):
train_data.loc[valid_indicies, "kfold"]=fold
``````

After running the cell above, we will output the new csv file (`train_kfolds.csv`) with kfolds by running the code block below:

``````train_data.to_csv('train_kfolds.csv', index=False)
``````

Here’s the Kaggle notebook, which you can copy and edit.

### Building a regression model

After creating the kfolds, we will download the `train_kfolds.csv` from the output data on our Create_kFolds notebook.

We’ll then follow the same steps on setting up your environment to create a new notebook called RegressionModel and upload the Dataset and `train_kfolds.csv` data.

After we’re done with the environment setup, we’ll proceed to build our model.

#### Importing necessary libraries

To build our regression model, we need to import the following libraries:

``````import pandas as pd
import  numpy  as  np
from  sklearn.preprocessing  import  OrdinalEncoder
from  sklearn.model_selection  import  train_test_split
from  sklearn.ensemble  import  RandomForestRegressor
from  sklearn.metrics  import  mean_squared_error
from  xgboost  import  XGBRegressor
``````

Once done, we will then proceed to read our data.

We will read our newly uploaded data, `Dataset2` and `trainfolds` using the following code block below:

``````data = pd.read_csv('/kaggle/input/trainfolds/train_kfolds.csv')
``````

#### Feature selection

We will select the useful features from our dataset and remove the not so useful/impactful features. The not so useful features in this dataset would be; `id`, `target`, and `kfold`.

To select the useful features, run the following block of code:

``````useful_features = [i for i in data.columns if i not in ("id", "target","kfold")]
object_cols = [col for col in useful_features if "cat" in col]
test_data = test_data[useful_features]
``````

#### Modeling

To build our model, we will run the following block of code:

``````final_predictions =[]

for fold in range(5):
xtrain = data[data.kfold != fold].reset_index(drop=True)
xvalid = data[data.kfold == fold].reset_index(drop=True)
xtest = test_data.copy()

ytrain = xtrain.target
yvalid = xvalid.target

xtrain = xtrain[useful_features]
xvalid = xvalid[useful_features]

# Data Encoding
oe = OrdinalEncoder()
xtrain[object_cols] = oe.fit_transform(xtrain[object_cols])
xvalid[object_cols] = oe.transform(xvalid[object_cols])
xtest[object_cols] = oe.transform(xtest[object_cols])

# Model Training
model = XGBRegressor(random_state = fold, n_jobs=5)
model.fit(xtrain, ytrain)
preds_valid = model.predict(xvalid)
preds_test = model.predict(xtest)
final_predictions.append(preds_test)
print(fold, mean_squared_error(yvalid, preds_valid, squared=False))
``````

For each fold, we will encode the data and then train the model using XGBoost (Extreme Gradient Boosting), an ensemble learning technique to boost the performance of our model.

XGBoost is a regularized boosting technique that provides high predictive power and is faster than other boosting techniques. We will then evaluate each fold individually and print out the results of the model.

#### Model evaluation

After individually evaluating each fold, we will now evaluate our model’s performance by getting the mean predictions on our test data.

To do this, use the following code block:

``````preds = np.mean(np.column_stack(final_predictions), axis=1)
``````

To see how our model performed, we will output the results of our model’s prediction using the following code:

``````submission.target =preds
submission.to_csv("submission1.csv", index=False)
``````

To see an output of our submission file, run the following code:

``````sub = pd.read_csv('/kaggle/output/submission1.csv')
sub
``````

Bonus: You can submit a late submission to 30 days ML Kaggle challenge and see how your model performs, i.e., if you had signed up for the challenge earlier.

### Hyperparameter optimization

In this process, we’ll fine-tune and optimize our model’s algorithm parameters until we achieve the desired result.

A few common XGBoost parameters with a large effect on the model perfomance include; n_jobs, max_depth, learning_rate, n_estimators, colsample_bytree, and subsample.

To fine-tune our model, add the following changes to the XGBoost regressor:

``````model = XGBRegressor(random_state = fold, n_jobs=5, learning_rate =0.1, subsample=0.8,
max_depth = 5, min_child_weight = 1, gamma = 0, scale_pos_weight = 1)
``````

Once you run the above code, you’ll see our model’s result improve slightly better than our first example. You can continue changing the parameters until it meets the desired goal. For example, you can target a value like `0.7100` to measure your models’ success.

You can also look at scikit-learn’s GridSearchCV or Optuna which makes it easier to fine-tune your model.

Read more on this detailed hyperparameter tuning article that goes beyond the scope of this tutorial.

Here’s the Kaggle notebook for our regression model.

### Conclusion

Building a performance-driven model is not a very easy task. It involves refining our model again and again until we get the desired outcome.

Either way, mastering the art of modeling can be very rewarding, whether it is in a machine learning or a data science project, or a competition.

Happy coding!

Peer Review Contributions by: Willies Ogola