When working with tabular data, it’s common to have to transform one or more columns to make them more amenable to analysis or modeling. In many cases, these transformations can be easily accomplished using the pandas library. However, when working with large datasets or building machine learning pipelines, it can be more efficient to use scikit-learn’s ColumnTransformer class to apply transformations to specific columns of the data.
In this blog post, we’ll demonstrate how to use a custom transformer with scikit-learn’s ColumnTransformer to transform one or more columns of a Pandas DataFrame.
Example 1: Transforming NumPy arrays
Let’s start with a simple example where we have a NumPy array with three columns, and we want to transform the first two columns into two new columns.
import numpy as np |
In this example, the CustomTransformer class takes two input columns and transforms them into two output columns. The ColumnTransformer applies this transformer to columns 0 and 1 of the input data, and preserves column 2. The “passthrough” option has been used to preserve the remaining column in its original form.
Example 2: Transforming Pandas DataFrames
Now, let’s modify the previous example to work with a Pandas DataFrame instead of a NumPy array.
import pandas as pd |
In this example, the CustomTransformer class takes two input columns (‘A’ and ‘B’) and transforms them into two output columns (‘A_squared’ and ‘B_sqrt’) in a pandas DataFrame. The ColumnTransformer applies this transformer to columns ‘A’ and ‘B’ of the input data, and preserves column ‘C’. The “passthrough” option has been used to preserve the remaining column ‘C’ in its original form.