Ways to make your Pandas operatoins 100 times faster


Today we discusses three ways to speed up operations in the Python data analysis library Pandas. Pandas is useful for working with tabular data stored in spreadsheets and databases. It provides many functions for manipulating and transforming dataframes, or structural data.

Method 1: using itertuples() to iterate over dataframe rows

explain: There are many places that we need to iterate rows of a dataframe and do some manipulations. Naively, we just check each row of data without thinking too much, here we use a simple sum operations to show the performance change

generate a dataframe from random numbers


import pandas as pd
import random

df = pd.DataFrame(
{
"a": [random.randint(0,100) for _ in range(100000)],
"b": [random.randint(100,200) for _ in range(100000)],
}
)
df

a b
0 17 161
1 77 150
2 30 121
3 18 130
4 31 178
... ... ...
99995 5 183
99996 46 138
99997 9 133
99998 25 162
99999 98 144

100000 rows × 2 columns

situation 1: naive loop

%%timeit

results = []
for i in range(len(df)):
results.append(df.iloc[i]['a']+df.iloc[i]['b'])
11.9 s ± 297 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

situation 2: using iterrows()

%%timeit
results = []
for index, row in df.iterrows():
results.append(row['a']+row['b'])

3.61 s ± 156 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

situation 3: using itertuples()

%%timeit
results = []
for row in df.itertuples():
results.append(row.a+row.b)
69.4 ms ± 3.16 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

conclusion: using itertuples() is 52 times faster than iterrows(), and 171 times than naive looping!

Method 2: how to create new dataframe efficiently

explain: when we need to make a new dataframe and add a new column to it, for example, a sum of two existing columns

situation 1: get data list, then make a new dataframe from the data

%%timeit

results = []
for row in df.itertuples():
results.append( (row.a, row.b, row.a+row.b) )

new_df = pd.DataFrame(data=results)
133 ms ± 2.43 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

situation 2: copy the old dataframe to a new one, then make new column using apply() function

%%timeit
new_df = df.copy()

new_df['c'] = new_df.apply(lambda row: row['a']+row['b'],axis=1)
988 ms ± 106 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

situation 3: copy the old dataframe to a new one, then make new column using native dataframe vectorized operation

%%timeit
new_df = df.copy()

new_df['c'] = new_df['a']+new_df['b']
956 µs ± 14 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

conclusion: if we can leverage native df operations without for looping, it’s the best, could be 1000 times faster; if we have complex
manipulation, making new data then create dataframe out of it, might be a better choice

Method 3: how to use apply() function efficiently

under the hood, the popular apply() function is a for loop with some overhead; when possible, we can leverage benefits of vectorized operations; we take a conditional multiplication as an example here

situation 1: use apply() directly

%%timeit
def condition_multi(a):
if a>30:
return a*2
else:
return a*3


df['c'] = df['a'].apply(condition_multi)

33.4 ms ± 2.41 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

situation 2: leverage NumPy operations

import numpy as np
%%timeit

df['c'] = np.where(df['a']>30, df['a']*2,df['a']*3)

1.18 ms ± 304 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)

conclusion: using numpy makes it 28 times faster!

https://github.com/robotlearner001/blog/tree/main/Make-pandas-100-times-faster


Author: robot learner
Reprint policy: All articles in this blog are used except for special statements CC BY 4.0 reprint policy. If reproduced, please indicate source robot learner !
  TOC