During the test stage, i.e., once the model is on production, for any new data,
tsfresh feature generation does not depend the training data. So one can apply the same feature engineering process as the training data
without worrying about stroing information from training stage.
On ther hand, one can also use the following example to leverage scikit learn pipleline style to handel the feature generation
for both training and test stages.
Feature Selection in a sklearn pipeline
import pandas as pd |
Load and Prepare the Data
Check out the first example notebook to learn more about the data and format.
from tsfresh.examples.robot_execution_failures import download_robot_execution_failures |
We want to use the extracted features to predict for each of the robot executions, if it was a failure or not.
Therefore our basic “entity” is a single robot execution given by a distinct id
.
A dataframe with these identifiers as index needs to be prepared for the pipeline.
X = pd.DataFrame(index=y.index) |
Build the pipeline
We build a sklearn pipeline that consists of a feature extraction step (RelevantFeatureAugmenter
) with a subsequent RandomForestClassifier
.
The RelevantFeatureAugmenter
takes roughly the same arguments as extract_features
and select_features
do.
ppl = Pipeline([ |
Here comes the tricky part!
The input to the pipeline will be our dataframe X
, which one row per identifier.
It is currently empty.
But which time series data should the RelevantFeatureAugmenter
to actually extract the features from?
We need to pass the time series data (stored in df_ts
) to the transformer.
In this case, df_ts contains the time series of both train and test set, if you have different dataframes for
train and test set, you have to call set_params two times
(see further below on how to deal with two independent data sets)
ppl.set_params(augmenter__timeseries_container=df_ts); |
We are now ready to fit the pipeline
ppl.fit(X_train, y_train) |
The augmenter has used the input time series data to extract time series features for each of the identifiers in the X_train
and selected only the relevant ones using the passed y_train
as target.
These features have been added to X_train
as new columns.
The classifier can now use these features during trainings.
Prediction
During interference, the augmentor does only extract the relevant features it has found out in the training phase and the classifier predicts the target using these features.
y_pred = ppl.predict(X_test) |
So, finally we inspect the performance:
print(classification_report(y_test, y_pred)) |
You can also find out, which columns the augmenter has selected
ppl.named_steps["augmenter"].feature_selector.relevant_features |
In this example we passed in an empty (except the index) X_train
or X_test
into the pipeline.
However, you can also fill the input with other features you have (e.g. features extracted from the metadata)
or even use other pipeline components before.
Separating the time series data containers
In the example above we passed in a single df_ts
into the RelevantFeatureAugmenter
, which was used both for training and predicting.
During training, only the data with the id
s from X_train
where extracted and during prediction the rest.
However, it is perfectly fine to call set_params
twice: once before training and once before prediction.
This can be handy if you for example dump the trained pipeline to disk and re-use it only later for prediction.
You only need to make sure that the id
s of the enteties you use during training/prediction are actually present in the passed time series data.
df_ts_train = df_ts[df_ts["id"].isin(y_train.index)] |
ppl.set_params(augmenter__timeseries_container=df_ts_train); |
import pickle |
Later: load the fitted model and do predictions on new, unseen data
import pickle |
ppl.set_params(augmenter__timeseries_container=df_ts_test); |
print(classification_report(y_test, y_pred)) |