Maximizing Model Performance with Feature Construction
Feature construction, also called feature engineering, is the process of making new features (also called variables, attributes, or predictors) from existing data that can be used to improve the performance of a machine learning algorithm.
In machine learning, features are the input variables that are used to make predictions or groups. In feature construction, existing variables are either transformed or combined to make new features to map the patterns in the data better.
How does Feature Construction work?
Feature construction is the process of making new features out of the features we already have in our data set. The new features are often more relevant to the prediction task than the original set of features. This can help the machine learning model get better results.
Sometimes, new features are made by doing a simple math operation, like multiplication or division, between two existing features (as demonstrated in the example below). In other situations, good indicative features require more specific domain knowledge about the data set.
Feature construction is part of a larger process called “feature engineering,” which turns the input features into a new set of features. This process includes many operations, such as feature selection, discretization, normalisation, etc., that change the input features into new features.
Feature Construction Example
We will use the California housing dataset from Scikit-Learn to show how to build features. The goal of this data set is to determine the median house price in a given California district based on things like the median income or the average number of rooms per home.
First, we import the necessary libraries:
from sklearn.datasets import fetch_california_housing
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from xgboost.sklearn import XGBRegressor
- sklearn.datasets: It is a module in the popular Python machine learning library scikit-learn. It provides a set of datasets that have already been cleaned up to some extent and set up in a way that makes it easy for machine learning algorithms to use them.
- sklearn.datasets: NumPy is the fundamental package for scientific computing in Python. It has functions for working in domain of linear algebra, fourier transform, and matrices and is used significantly in machine learning. In Python we have lists that serve the purpose of arrays, but they are slow to process. NumPy aims to provide an array object that is up to 50x faster than traditional Python lists.
- Pandas: As an open-source software library built on top of Python specifically for data manipulation and analysis, Pandas offers data structure and operations for powerful, flexible, and easy-to-use data analysis and manipulation. Pandas strengthens Python by giving the popular programming language the capability to work with spreadsheet-like data enabling fast loading, aligning, manipulating, and merging, in addition to other key functions.
Afterward we import the dataset and divide it into features and target variables:
dataset = fetch_california_housing()
X, y = dataset.data, dataset.target
feature_names = dataset.feature_names
To get a better glimpse of the dataset, Let’s combine the features and labels into one DataFrame:
complete_data = np.column_stack((X, y))
df = pd.DataFrame(mat, columns=np.append(feature_names, 'MedianValue'))
df.head()
Before we add anything new, let’s see how well a simple linear regression and Extreme Gradient Boosting (XGBoost) model works on this dataset. First, we’ll split our data set into 80% training set and 20% test set:
X_train, X_test, y_train, y_test = train_test_split(X, y,
test_size=0.2, random_state=23)
Then, we use the training set to fit the Regression models, which we’ll test on both the training set and the test set:
reg = LinearRegression()
reg.fit(X_train, y_train)
train_score = reg.score(X_train, y_train)
print('R2 score on the training set:', np.round(train_score, 5))
test_score = reg.score(X_test, y_test)
print('R2 score on the test set:', np.round(test_score, 5))
reg = XGBRegressor()
reg.fit(X_train, y_train)
train_score = reg.score(X_train, y_train)
print('R2 score on the training set:', np.round(train_score, 5))
test_score = reg.score(X_test, y_test)
print('R2 score on the test set:', np.round(test_score, 5))
The results we get are:
R2 score on the training set (Linear Regression): 0.6089
R2 score on the test set (Linear Regression): 0.59432
R2 score on the training set (XGB Regression): 0.93963
R2 score on the test set (XGB Regression): 0.83604
Constructing a New Feature
Now, let’s glimpse at the list of features and see if we can think of any other feature that might be more relevant to our target variable (the median house price). Let’s look at the average number of rooms as an example. The feature may not be a good indicator of the house price by itself because of following given reasons:
- Some districts may have larger families with lower incomes hence the house price will be lower in this instance.
- Some districts may have smaller families with higher incomes, here the house price will be higher.
In the same way, the average number of bedrooms is important. Instead of using each of these two things on its own, you could use the ratio between them. Surely, houses with more rooms per bedroom mean a more luxurious way of life and could mean the median house price is higher.
df['RoomsPerBedroom'] = df['AveRooms'] / df['AveBedrms']
Now, let’s look at how well our features match up with the target label (the MedianValue). To do this, we will use the DataFrame’s corrwith() method, which calculates the Pearson correlation coefficient between all the columns and the target column:
df.corrwith(df['MedianValue']).sort_values(ascending=False)
MedianValue 1.000000
MedInc 0.688075
RoomsPerBedrooms 0.383672
AveRooms 0.151948
HouseAge 0.105623
AveOccup -0.023737
Population -0.024650
Longitude -0.045967
AveBedrms -0.046701
Latitude -0.144160
dtype: float64
The correlation shows that our new feature (RoomsPerBedroom) impacts the target variable significantly more than the other variables.
How well the model works with the new feature
Let’s see what happens to our regression models when we add the new feature. First, we need to get the features (X) and labels (y) from the new DataFrame:
X = df.drop(['MedianValue'], axis=1)
y = df['MedianValue']
As we have added new column tot he dataset hence we’ll need to divide the dataset again into a training set and a test set.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
Finally, we fit our model and evaluate it:
reg.fit(X_train, y_train)
train_score = reg.score(X_train, y_train)
print('R2 score on the training set:', np.round(train_score, 5))
test_score = reg.score(X_test, y_test)
print('R2 score on the test set:', np.round(test_score, 5))
reg = XGBRegressor()
reg.fit(X_train, y_train)
train_score = reg.score(X_train, y_train)
print('R2 score on the training set:', np.round(train_score, 5))
test_score = reg.score(X_test, y_test)
print('R2 score on the test set:', np.round(test_score, 5))
The results we get are:
R2 score on the training set (Linear Regression): 0.61645
R2 score on the test set (Linear Regression): 0.60117
R2 score on the training set (XGB Regression): 0.94348
R2 score on the test set (XGB Regression): 0.83903
Through simple tweaks of feature engineering, We have increased the accuracy on both the training and testing dataset. Although this improvement is not severe but there are many instances where you’ll see how feature construction will be a game changer!