handles feature order in training and online production stage to avoid inconsistent error


In applying machine learning models in production stage, like lightGBM model or any models.
While we all know the order of features shoud be same for both training stage, test stage, and the production stage.

In practice we might ignore that. In produciton stage, new data might come as a json format, where orders will disappear,
it has nothing to do with the original feature order in the model training stage.

The comming json will be converted to dataframe format, and passed to the model for prediction. We might usually
igore the fact that, the new dataframe column order is different from the original training dataframe column order now.
And it’s important to make sure they are consistent, and not up to the randome fate.

There are many ways to achieve this, the following shows how to do it in a pipeline fashion.

define piplenow, to treat the effects systematically

import pandas as pd
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.pipeline import Pipeline

class LastStepTransformer(BaseEstimator, TransformerMixin):


# Class Constructor

def __init__(self):

self.traincolumns = []

print('initialized')

# Return self, nothing else to do here
def fit(self, X, y=None):

self.traincolumns = X.columns

return self


def transform(self, X_, y=None):



X = X_.copy()
# make sure any new data follows the same order of features used in the training stage
return X[self.traincolumns]

# make an data process pipeline, using the above transformer as the last steps here.
# in practice, any preprocessing steps can be put here as well

dataPipeline = Pipeline([
('last_step',LastStepTransformer())
])






initialized

show an example


data_train =[[1.1,2.2],[2.1,3.2]]
data_test =[[3.1,5.2],[1.1,2.2]]

df_train = pd.DataFrame(data=data_train,columns=['col1','col2'])
df_test = pd.DataFrame(data=data_test,columns=['col2','col1'])

display(df_train)
display(df_test)


col1 col2
0 1.1 2.2
1 2.1 3.2

col2 col1
0 3.1 5.2
1 1.1 2.2

now in the training stage, we call fit_transform() of the data pipeline, so the pipeline will remembers the orignal order



dataPipeline.fit_transform(df_train)

col1 col2
0 1.1 2.2
1 2.1 3.2

now in the test stage, we only call transform() of the datapipleine, so any new data will be reordered as the training data

print('test data, notice the column order')
display(df_test)
print('after transform, notice the column order now changes')
dataPipeline.transform(df_test)
test data, notice the column order

col2 col1
0 3.1 5.2
1 1.1 2.2
after transform, ontice the column order now changes

col1 col2
0 5.2 3.1
1 2.2 1.1

Author: robot learner
Reprint policy: All articles in this blog are used except for special statements CC BY 4.0 reprint policy. If reproduced, please indicate source robot learner !
  TOC