handles feature order in training and online production stage to avoid inconsistent error

data science

Publish Date: 2021-06-18

In applying machine learning models in production stage, like lightGBM model or any models.
While we all know the order of features shoud be same for both training stage, test stage, and the production stage.

In practice we might ignore that. In produciton stage, new data might come as a json format, where orders will disappear,
it has nothing to do with the original feature order in the model training stage.

The comming json will be converted to dataframe format, and passed to the model for prediction. We might usually
igore the fact that, the new dataframe column order is different from the original training dataframe column order now.
And it’s important to make sure they are consistent, and not up to the randome fate.

There are many ways to achieve this, the following shows how to do it in a pipeline fashion.

define piplenow, to treat the effects systematically

import pandas as pd
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.pipeline import Pipeline

class LastStepTransformer(BaseEstimator, TransformerMixin):


    # Class Constructor

    def __init__(self):

        self.traincolumns = []

        print('initialized')

     # Return self, nothing else to do here
    def fit(self, X, y=None):

        self.traincolumns = X.columns

        return self

 
    def transform(self, X_, y=None):

 

        X = X_.copy()
        # make sure any new data follows the same order of features used in the training stage
        return X[self.traincolumns]
    
# make an data process pipeline, using the above transformer as the last steps here.
# in practice, any preprocessing steps can be put here as well

dataPipeline = Pipeline([
 ('last_step',LastStepTransformer())   
])

initialized

show an example


data_train =[[1.1,2.2],[2.1,3.2]]
data_test =[[3.1,5.2],[1.1,2.2]]

df_train = pd.DataFrame(data=data_train,columns=['col1','col2'])
df_test = pd.DataFrame(data=data_test,columns=['col2','col1'])

display(df_train)
display(df_test)

	col1	col2
0	1.1	2.2
1	2.1	3.2

	col2	col1
0	3.1	5.2
1	1.1	2.2

now in the training stage, we call fit_transform() of the data pipeline, so the pipeline will remembers the orignal order



dataPipeline.fit_transform(df_train)

	col1	col2
0	1.1	2.2
1	2.1	3.2

now in the test stage, we only call transform() of the datapipleine, so any new data will be reordered as the training data

print('test data, notice the column order')
display(df_test)
print('after transform, notice the column order now changes')
dataPipeline.transform(df_test)

test data, notice the column order

	col2	col1
0	3.1	5.2
1	1.1	2.2

after transform, ontice the column order now changes

	col1	col2
0	5.2	3.1
1	2.2	1.1

robot learner

https://datasciencebyexample.github.io/2021/06/18/2021-06-18-1/

All articles in this blog are used except for special statements CC BY 4.0 reprint policy. If reproduced, please indicate source robot learner !

preprocessing pandas tip machine learning

rare event encoding for categorical feature in machine learning in pandas dataframe

2021-06-23 data science

preprocessing pandas tip machine learning

some handy functions to group continous variables and missing value imputation in dataframe

2021-06-15 data science

python preprocessing pandas tip