aggregate features from different rows into one row in pandas dataframe


In many use cases, different features of the same event are stored in a table by multiple rows.
multiple columns will indicate each characteristics of one feature, such as name, value, timestamp, etc.

In machine learning, we need to aggregate them into one row for training, and the following shows how do do it in dataframe easily.

generate some example dataframe

import pandas as pd

data = [ ['1','name1','value1'],['1','name2','value2'],['1','name3','value3'],
['2','name1','value4'],['2','name2','value5']
]

df = pd.DataFrame(data=data)
df.columns =['id','name','value']
display(df)


id name value
0 1 name1 value1
1 1 name2 value2
2 1 name3 value3
3 2 name1 value4
4 2 name2 value5

group the dataframe by id, then aggregate all the feature values into one column

df2 = df.groupby('id').apply(lambda x: dict(x[['name','value']].values.tolist())).reset_index()
df3 = pd.DataFrame(data=df2[0].values.tolist())
display(df3)

name1 name2 name3
0 value1 value2 value3
1 value4 value5 NaN

put the above the transformation into a scikit-learn customed transformer


from sklearn.base import BaseEstimator, TransformerMixin

class my_Transformer(BaseEstimator, TransformerMixin):


#Class Constructor

def __init__(self):

print('start')





# Return self

def fit(self, X, y=None):

return self







#Customized transformer

def transform(self, X_, y=None):

X = X_.copy()

X2 = X.groupby('id').apply(lambda x: dict(x[['name','value']].values.tolist())).reset_index()
X3 = pd.DataFrame(data=X2[0].values.tolist())

return X3


return X1



# get a transformer object
my_transformer = my_Transformer()

# apply the transform on the original data

df_new = my_transformer.transform(df)
display(df_new)
start

name1 name2 name3
0 value1 value2 value3
1 value4 value5 NaN

Author: robot learner
Reprint policy: All articles in this blog are used except for special statements CC BY 4.0 reprint policy. If reproduced, please indicate source robot learner !
  TOC