If categorical features has too many values, it will generate too many features after encoding, such as one-hot encoding.
We could set the threshold, if certan value has percentage less than the threshold, we change the value to be ‘rare event’ or
something like that. By doing this, we make sure there are not too many levels for a categorical feature.
The following code can be applied on a dataframe:
def cat_rare_event(df,threshold=0.005): |
or we could put this step as a customized pipeline:
from sklearn.base import BaseEstimator, TransformerMixin |