For data science projects, one important steps in feature engineering is to make sure the order of feature columns during training and prediction/test time is the same. Otherwise, we will not get the results as we expect.
This is usually not a problem in train/test split or cross validation stages, where training and test data are generally split form the same dataframe. However, once model is put online, and the transformer need to handel each single event, which usually comes in the format of json data, then transformed to dataframe. During this process, the orignal order may not hold.
To ensure the same feature order is used, we could build a transformer for the pipeline; During the fit stage, the orignal order will be remembered, and during the transform stage, the same order will be enforced; Meanwhile, if there is any missing column, we will add a null value column.
set up some example dataframe
import pandas as pd
# training example, where we have 3 features df_train = pd.DataFrame(data=[['a','b','f'],['c','d','e']]) df_train.columns = ['cat1','cat2','cat3'] display(df_train)
# test example where we missing the 3rd feature df_test = pd.DataFrame(data=[['h','j']]) df_test.columns = ['cat2','cat1'] display(df_test)
cat1
cat2
cat3
0
a
b
f
1
c
d
e
cat2
cat1
0
h
j
a transformer which can be added to a full pipeline
from sklearn.base import BaseEstimator, TransformerMixin
X = X_.copy() #print(self.dtype_dict) train_columns = [] # add missing column if any for col in self.dtype_dict: train_columns.append(col) if col notin X.columns: # null boolean are treated as False; can also use other strategy as well if self.dtype_dict[col].startswith('bool'): X[col]=False else: X[col] = pd.Series(dtype=self.dtype_dict[col]) # apply same order to both training and test print(train_columns) X = X[train_columns] return X orderMaitain_transformer = orderMaitain_Transformer()
initialized
apply transfomer during training and test stages
during training stage
orderMaitain_transformer.fit_transform(df_train)
['cat1', 'cat2', 'cat3']
cat1
cat2
cat3
0
a
b
f
1
c
d
e
during test and prediction stage
# check that the resuls have an emppty column added, the order is the same as training orderMaitain_transformer.transform(df_test)
Reprint policy:
All articles in this blog are used except for special statements
CC BY 4.0
reprint policy. If reproduced, please indicate source
robot learner
!