A great columnTransformer example to preprocess data for online article SOV prediction

data science

Publish Date: 2022-07-16

ColumnTransformer is one of the most useful data transforming tools for data science projects. Here we show a complete example of applying columnTransformer as the preprocessing tools, and then apply RandomForrest Regression models to predict share of voice (SOV) for online articles.

data introduction

The original link of data is here: https://archive-beta.ics.uci.edu/ml/datasets/online+news+popularity
Features are already created, but checking the feature names alone would help us greatly to undrstand what would influence the online article ranking.

What do the instances that comprise the dataset represent?

Number of Attributes: 61 (58 predictive attributes, 2 non-predictive, 1 goal field)

Attribute Information:
0. url: URL of the article (non-predictive)
1. timedelta: Days between the article publication and the dataset acquisition (non-predictive)
2. n_tokens_title: Number of words in the title
3. n_tokens_content: Number of words in the content
4. n_unique_tokens: Rate of unique words in the content
5. n_non_stop_words: Rate of non-stop words in the content
6. n_non_stop_unique_tokens: Rate of unique non-stop words in the content
7. num_hrefs: Number of links
8. num_self_hrefs: Number of links to other articles published by Mashable
9. num_imgs: Number of images
10. num_videos: Number of videos
11. average_token_length: Average length of the words in the content
12. num_keywords: Number of keywords in the metadata
13. data_channel_is_lifestyle: Is data channel ‘Lifestyle’?
14. data_channel_is_entertainment: Is data channel ‘Entertainment’?
15. data_channel_is_bus: Is data channel ‘Business’?
16. data_channel_is_socmed: Is data channel ‘Social Media’?
17. data_channel_is_tech: Is data channel ‘Tech’?
18. data_channel_is_world: Is data channel ‘World’?
19. kw_min_min: Worst keyword (min. shares)
20. kw_max_min: Worst keyword (max. shares)
21. kw_avg_min: Worst keyword (avg. shares)
22. kw_min_max: Best keyword (min. shares)
23. kw_max_max: Best keyword (max. shares)
24. kw_avg_max: Best keyword (avg. shares)
25. kw_min_avg: Avg. keyword (min. shares)
26. kw_max_avg: Avg. keyword (max. shares)
27. kw_avg_avg: Avg. keyword (avg. shares)
28. self_reference_min_shares: Min. shares of referenced articles in Mashable
29. self_reference_max_shares: Max. shares of referenced articles in Mashable
30. self_reference_avg_sharess: Avg. shares of referenced articles in Mashable
31. weekday_is_monday: Was the article published on a Monday?
32. weekday_is_tuesday: Was the article published on a Tuesday?
33. weekday_is_wednesday: Was the article published on a Wednesday?
34. weekday_is_thursday: Was the article published on a Thursday?
35. weekday_is_friday: Was the article published on a Friday?
36. weekday_is_saturday: Was the article published on a Saturday?
37. weekday_is_sunday: Was the article published on a Sunday?
38. is_weekend: Was the article published on the weekend?
39. LDA_00: Closeness to LDA topic 0
40. LDA_01: Closeness to LDA topic 1
41. LDA_02: Closeness to LDA topic 2
42. LDA_03: Closeness to LDA topic 3
43. LDA_04: Closeness to LDA topic 4
44. global_subjectivity: Text subjectivity
45. global_sentiment_polarity: Text sentiment polarity
46. global_rate_positive_words: Rate of positive words in the content
47. global_rate_negative_words: Rate of negative words in the content
48. rate_positive_words: Rate of positive words among non-neutral tokens
49. rate_negative_words: Rate of negative words among non-neutral tokens
50. avg_positive_polarity: Avg. polarity of positive words
51. min_positive_polarity: Min. polarity of positive words
52. max_positive_polarity: Max. polarity of positive words
53. avg_negative_polarity: Avg. polarity of negative words
54. min_negative_polarity: Min. polarity of negative words
55. max_negative_polarity: Max. polarity of negative words
56. title_subjectivity: Title subjectivity
57. title_sentiment_polarity: Title polarity
58. abs_title_subjectivity: Absolute subjectivity level
59. abs_title_sentiment_polarity: Absolute polarity level
60. shares: Number of shares (target)

Additional Information

The articles were published by Mashable (www.mashable.com) and their content as the rights to reproduce it belongs to them. Hence, this dataset does not share the original content but some statistics associated with it. The original content be publicly accessed and retrieved using the provided urls.
Acquisition date: January 8, 2015
The estimated relative performance values were estimated by the authors using a Random Forest classifier and a rolling windows as assessment method. See their article for more details on how the relative performance values were set.

load libraires

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder
from sklearn.pipeline import make_pipeline
from sklearn.ensemble import RandomForestRegressor
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import KBinsDiscretizer
from sklearn.inspection import permutation_importance
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import GridSearchCV
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.base import BaseEstimator, TransformerMixin
from matplotlib import pyplot as plt

import data

df = pd.read_csv('data/OnlineNewsPopularity.csv')
df

	url	timedelta	n_tokens_title	n_tokens_content	n_unique_tokens	n_non_stop_words	n_non_stop_unique_tokens	num_hrefs	num_self_hrefs	num_imgs	...	min_positive_polarity	max_positive_polarity	avg_negative_polarity	min_negative_polarity	max_negative_polarity	title_subjectivity	title_sentiment_polarity	abs_title_subjectivity	abs_title_sentiment_polarity	shares
0	http://mashable.com/2013/01/07/amazon-instant-...	731.0	12.0	219.0	0.663594	1.0	0.815385	4.0	2.0	1.0	...	0.100000	0.70	-0.350000	-0.600	-0.200000	0.500000	-0.187500	0.000000	0.187500	593
1	http://mashable.com/2013/01/07/ap-samsung-spon...	731.0	9.0	255.0	0.604743	1.0	0.791946	3.0	1.0	1.0	...	0.033333	0.70	-0.118750	-0.125	-0.100000	0.000000	0.000000	0.500000	0.000000	711
2	http://mashable.com/2013/01/07/apple-40-billio...	731.0	9.0	211.0	0.575130	1.0	0.663866	3.0	1.0	1.0	...	0.100000	1.00	-0.466667	-0.800	-0.133333	0.000000	0.000000	0.500000	0.000000	1500
3	http://mashable.com/2013/01/07/astronaut-notre...	731.0	9.0	531.0	0.503788	1.0	0.665635	9.0	0.0	1.0	...	0.136364	0.80	-0.369697	-0.600	-0.166667	0.000000	0.000000	0.500000	0.000000	1200
4	http://mashable.com/2013/01/07/att-u-verse-apps/	731.0	13.0	1072.0	0.415646	1.0	0.540890	19.0	19.0	20.0	...	0.033333	1.00	-0.220192	-0.500	-0.050000	0.454545	0.136364	0.045455	0.136364	505
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
39639	http://mashable.com/2014/12/27/samsung-app-aut...	8.0	11.0	346.0	0.529052	1.0	0.684783	9.0	7.0	1.0	...	0.100000	0.75	-0.260000	-0.500	-0.125000	0.100000	0.000000	0.400000	0.000000	1800
39640	http://mashable.com/2014/12/27/seth-rogen-jame...	8.0	12.0	328.0	0.696296	1.0	0.885057	9.0	7.0	3.0	...	0.136364	0.70	-0.211111	-0.400	-0.100000	0.300000	1.000000	0.200000	1.000000	1900
39641	http://mashable.com/2014/12/27/son-pays-off-mo...	8.0	10.0	442.0	0.516355	1.0	0.644128	24.0	1.0	12.0	...	0.136364	0.50	-0.356439	-0.800	-0.166667	0.454545	0.136364	0.045455	0.136364	1900
39642	http://mashable.com/2014/12/27/ukraine-blasts/	8.0	6.0	682.0	0.539493	1.0	0.692661	10.0	1.0	1.0	...	0.062500	0.50	-0.205246	-0.500	-0.012500	0.000000	0.000000	0.500000	0.000000	1100
39643	http://mashable.com/2014/12/27/youtube-channel...	8.0	10.0	157.0	0.701987	1.0	0.846154	1.0	1.0	0.0	...	0.100000	0.50	-0.200000	-0.200	-0.200000	0.333333	0.250000	0.166667	0.250000	1300

39644 rows × 61 columns

define features and targets

# be careful the features has a prefix of white space
numeric_features = [' n_tokens_title', ' n_tokens_content', ' n_unique_tokens', ' n_non_stop_words', ' n_non_stop_unique_tokens', ' num_hrefs', ' num_self_hrefs', ' num_imgs', ' num_videos', ' average_token_length', ' num_keywords', ' data_channel_is_lifestyle', ' data_channel_is_entertainment', ' data_channel_is_bus', ' data_channel_is_socmed', ' data_channel_is_tech', ' data_channel_is_world', ' kw_min_min', ' kw_max_min', ' kw_avg_min', ' kw_min_max', ' kw_max_max', ' kw_avg_max', ' kw_min_avg', ' kw_max_avg', ' kw_avg_avg', ' self_reference_min_shares', ' self_reference_max_shares', ' self_reference_avg_sharess', ' weekday_is_monday', ' weekday_is_tuesday', ' weekday_is_wednesday', ' weekday_is_thursday', ' weekday_is_friday', ' weekday_is_saturday', ' weekday_is_sunday', ' is_weekend', ' LDA_00', ' LDA_01', ' LDA_02', ' LDA_03', ' LDA_04', ' global_subjectivity', ' global_sentiment_polarity', ' global_rate_positive_words', ' global_rate_negative_words', ' rate_positive_words', ' rate_negative_words', ' avg_positive_polarity', ' min_positive_polarity', ' max_positive_polarity', ' avg_negative_polarity', ' min_negative_polarity', ' max_negative_polarity', ' title_subjectivity', ' title_sentiment_polarity', ' abs_title_subjectivity', ' abs_title_sentiment_polarity']
categorical_features =[]


target = ' shares'

X = df.drop(target,axis=1)
y = df[target]

training and test data split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

define transformers

# steps to handle numeric features
numeric_transformer = Pipeline(steps=[
        ('num_imputer', SimpleImputer(strategy='mean')),
        ('scaler', StandardScaler())
    ])

# steps to handel categorical features
categorical_transformer = Pipeline(steps = [        
      ("cat_imputer", SimpleImputer(strategy='constant', fill_value = "missing")),
      ("encoder", OneHotEncoder(handle_unknown='ignore'))
    ])

# TFIDF vectorizer to handel the url only
tfidfvectorizer = TfidfVectorizer(max_features = 100, stop_words='english')

# a customize transformer to be used in the  ColumnTransformer
# which takes a list of one or string features, and return the word counts for the list of features

class Counter(BaseEstimator, TransformerMixin):
   
    def fit(self, x, y=None):
        return self
    def transform(self, _X):
        X = pd.DataFrame(_X)
        for col in X.columns:
            X[col]= X[col].apply(lambda x: len(x.split('-')))
        return X
    
counter = Counter()


# integrate a preprocessor to handel different types of features
# notice in the transformer list of the ColumnTransformer,the parameter is 'url', a string type
# while parameters for other transformers are list type
preprocessor = ColumnTransformer(
    transformers=[
        ('url', tfidfvectorizer, 'url'),
        ('num', numeric_transformer, numeric_features),
        ('count', counter, ['url']),
        #('cat', categorical_transformer, categorical_features)
    ])

# check the format of the training data after preprocessing steps
# preprocessor.fit_transform(X_train)

define a whole piple to include model

pipeline = Pipeline(
    steps =[
        ('preprocessor', preprocessor),                            
        ('model',RandomForestRegressor(verbose=0, n_jobs=-1))
    ])

pipeline.fit(X_train,y_train)

Pipeline(steps=[('preprocessor',
                 ColumnTransformer(transformers=[('url',
                                                  TfidfVectorizer(max_features=100,
                                                                  stop_words='english'),
                                                  'url'),
                                                 ('num',
                                                  Pipeline(steps=[('num_imputer',
                                                                   SimpleImputer()),
                                                                  ('scaler',
                                                                   StandardScaler())]),
                                                  [' n_tokens_title',
                                                   ' n_tokens_content',
                                                   ' n_unique_tokens',
                                                   ' n_non_stop_words',
                                                   ' n_non_stop_unique_tokens',
                                                   ' num_hrefs',
                                                   ' num_self_hrefs',
                                                   ' n...
                                                   ' data_channel_is_tech',
                                                   ' data_channel_is_world',
                                                   ' kw_min_min', ' kw_max_min',
                                                   ' kw_avg_min', ' kw_min_max',
                                                   ' kw_max_max', ' kw_avg_max',
                                                   ' kw_min_avg', ' kw_max_avg',
                                                   ' kw_avg_avg',
                                                   ' self_reference_min_shares',
                                                   ' self_reference_max_shares',
                                                   ' '
                                                   'self_reference_avg_sharess',
                                                   ' weekday_is_monday', ...]),
                                                 ('count', Counter(),
                                                  ['url'])])),
                ('model', RandomForestRegressor(n_jobs=-1))])

plot the lift chart

def plot_predicted_vs_actuals(y_true, y_predicted, title = "Pred Vs Actual"):
    
    rmse= mean_squared_error(y_true= y_true, y_pred=y_predicted,squared=False)
    print("RMSE:", rmse)
   
    predVsActual = pd.DataFrame( {'y_true': y_true,'y_pred': y_predicted})
    predVsActual["pred_deciles"] = pd.qcut(predVsActual["y_pred"],q=10 ,labels=False, duplicates = 'drop')
   
    aggregations = {
        'y_true':'mean',
        'y_pred': 'mean'
    }
    counts = predVsActual.groupby("pred_deciles").count()["y_pred"]
 
    predVsActual= predVsActual.groupby("pred_deciles").agg(aggregations)
    predVsActual["counts"] = counts
    predVsActual[["y_true","y_pred"]].plot(title=title)
    plt.show()
    print(predVsActual)
    
    
test_preds = pipeline.predict(X_test)
plot_predicted_vs_actuals(y_test, test_preds)

RMSE: 11240.043547942365

lift plot

                   y_true        y_pred  counts
pred_deciles                                   
0             1410.976040   1226.474250     793
1             2334.482976   1623.177175     793
2             2066.532156   1958.356570     793
3             2311.552333   2279.694124     793
4             2859.727617   2645.796381     793
5             2926.430556   3046.564318     792
6             3554.585120   3543.738184     793
7             4165.051702   4249.105233     793
8             4935.030265   5450.838789     793
9             6709.374527  11123.519130     793