In applying machine learning models in production stage, like lightGBM model or any models. While we all know the order of features shoud be same for both training stage, test stage, and the production stage.
In practice we might ignore that. In produciton stage, new data might come as a json format, where orders will disappear, it has nothing to do with the original feature order in the model training stage.
The comming json will be converted to dataframe format, and passed to the model for prediction. We might usually igore the fact that, the new dataframe column order is different from the original training dataframe column order now. And it’s important to make sure they are consistent, and not up to the randome fate.
There are many ways to achieve this, the following shows how to do it in a pipeline fashion.
define piplenow, to treat the effects systematically
import pandas as pd from sklearn.base import BaseEstimator, TransformerMixin from sklearn.pipeline import Pipeline
# Return self, nothing else to do here deffit(self, X, y=None):
self.traincolumns = X.columns
return self
deftransform(self, X_, y=None):
X = X_.copy() # make sure any new data follows the same order of features used in the training stage return X[self.traincolumns] # make an data process pipeline, using the above transformer as the last steps here. # in practice, any preprocessing steps can be put here as well
now in the training stage, we call fit_transform() of the data pipeline, so the pipeline will remembers the orignal order
dataPipeline.fit_transform(df_train)
col1
col2
0
1.1
2.2
1
2.1
3.2
now in the test stage, we only call transform() of the datapipleine, so any new data will be reordered as the training data
print('test data, notice the column order') display(df_test) print('after transform, notice the column order now changes') dataPipeline.transform(df_test)
test data, notice the column order
col2
col1
0
3.1
5.2
1
1.1
2.2
after transform, ontice the column order now changes
Reprint policy:
All articles in this blog are used except for special statements
CC BY 4.0
reprint policy. If reproduced, please indicate source
robot learner
!