hyperparameter tuning using Optuna with RandomForestClassifier Example (Python code)

data science

Publish Date: 2022-07-02

For some popular machine learning algorithms, how to set the hyper parameters could affect machine learning algorithm performance greatly.

One naive way is to loop though different combinations of the hyper parameter space and choose the best configuration. This is called grid search strategy. But this method could be very slow.

A better way is to use some kind of optimization method to optimize our optimization. Tools such as Optuna and Hyperopt play roles here.

In the following, we will use the Optuna as example, and apply it on a Random Forrest Classifier.

1. Import libraries and get the newsgroup data

import numpy as np
import os
from sklearn.datasets import fetch_20newsgroups
from sklearn.model_selection import cross_val_score
from sklearn.metrics import f1_score
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.pipeline import Pipeline
import joblib
from lightgbm import LGBMClassifier
from sklearn.ensemble import RandomForestClassifier

import optuna

data = fetch_20newsgroups()

X = data['data'][:5000]
y = data['target'][:5000]

2. Define a machine leaning pipeline with TfidfVectorizer and RandomForestClassifie

model = Pipeline([
    ('tfidf', TfidfVectorizer(stop_words='english')),   
    ('rf', RandomForestClassifier())
])

3. Define hyper parameter space and Optuna objective to optimize

def objective(trial):    
    
    joblib.dump(study, 'study.pkl')
    
    tfidf__analyzer = trial.suggest_categorical('tfidf__analyzer', ['word', 'char', 'char_wb']) 
    tfidf__lowercase = trial.suggest_categorical('tfidf__lowercase', [False, True]) 
    tfidf__max_features = trial.suggest_int('tfidf__max_features', 500, 10_000) 
    rf__n_estimators = trial.suggest_int('rf__num_estimators', 300, 500) 
    rf__max_depth = trial.suggest_int('rf__max_depth', 5, 15) 
    rf__min_samples_split = trial.suggest_int('rf__min_samples_split', 10, 30) 
    
   
    

    params = {
        'tfidf__analyzer': tfidf__analyzer,
        'tfidf__lowercase': tfidf__lowercase,
        'tfidf__max_features': tfidf__max_features,
        'rf__n_estimators': rf__n_estimators,
        'rf__max_depth': rf__max_depth,
        'rf__min_samples_split': rf__min_samples_split,
       
    }
    
    model.set_params(**params)

    return  -np.mean(cross_val_score(model, X, y, cv=3, n_jobs=-1,scoring='neg_log_loss'))

Notice that, by default Optuna tries to minimize the objective function, since we use native log loss function to maximize the Random Forrest Classifier, we add another negative sign in in front of the cross-validation scores.

4. Run the Optuna trials to find the best hyper parameter configuration

# by default, the direction is to minimizae, but can set it to maximize too
#study = optuna.create_study(direction='minimize')
study = optuna.create_study()


#study.optimize(objective, timeout=3600)
study.optimize(objective, n_trials=20)


# to record the value for the last time
joblib.dump(study, 'study.pkl')

Notice that, we are saving the hyper parameter optimization process into a local pickle file, which means we can monitor the process in the middle or at the end by opening another notebook.

5. how to visualize the results

%matplotlib inline
import joblib
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import optuna

data = joblib.load('study.pkl')

df = data.trials_dataframe()
df.dropna(inplace=True)
df.reset_index(inplace=True)

df['time'] = (df.datetime_complete - df.datetime_start).dt.total_seconds()
df = df[df.time>=0]


print('best val:',  round(df.value.min(),4))
print('best params:',  data.best_params)

a = sns.lineplot(x=df.index, y=df.value.cummin())
a.set_xlabel('trial number')
sns.scatterplot(x=df.index, y=df.value, color='red')
a.set_ylabel('log loss')
a.legend(['best value', "trial's value"]);