Today we discusses three ways to speed up operations in the Python data analysis library Pandas. Pandas is useful for working with tabular data stored in spreadsheets and databases. It provides many functions for manipulating and transforming dataframes, or structural data.
Method 1: using itertuples() to iterate over dataframe rows
explain: There are many places that we need to iterate rows of a dataframe and do some manipulations. Naively, we just check each row of data without thinking too much, here we use a simple sum operations to show the performance change
generate a dataframe from random numbers
import pandas as pd import random
df = pd.DataFrame( { "a": [random.randint(0,100) for _ inrange(100000)], "b": [random.randint(100,200) for _ inrange(100000)], } ) df
a
b
0
17
161
1
77
150
2
30
121
3
18
130
4
31
178
...
...
...
99995
5
183
99996
46
138
99997
9
133
99998
25
162
99999
98
144
100000 rows × 2 columns
situation 1: naive loop
%%timeit
results = [] for i inrange(len(df)): results.append(df.iloc[i]['a']+df.iloc[i]['b'])
11.9 s ± 297 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
situation 2: using iterrows()
%%timeit results = [] for index, row in df.iterrows(): results.append(row['a']+row['b'])
3.61 s ± 156 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
situation 3: using itertuples()
%%timeit results = [] for row in df.itertuples(): results.append(row.a+row.b)
69.4 ms ± 3.16 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
conclusion: using itertuples() is 52 times faster than iterrows(), and 171 times than naive looping!
Method 2: how to create new dataframe efficiently
explain: when we need to make a new dataframe and add a new column to it, for example, a sum of two existing columns
situation 1: get data list, then make a new dataframe from the data
%%timeit
results = [] for row in df.itertuples(): results.append( (row.a, row.b, row.a+row.b) ) new_df = pd.DataFrame(data=results)
133 ms ± 2.43 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
situation 2: copy the old dataframe to a new one, then make new column using apply() function
988 ms ± 106 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
situation 3: copy the old dataframe to a new one, then make new column using native dataframe vectorized operation
%%timeit new_df = df.copy()
new_df['c'] = new_df['a']+new_df['b']
956 µs ± 14 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
conclusion: if we can leverage native df operations without for looping, it’s the best, could be 1000 times faster; if we have complex manipulation, making new data then create dataframe out of it, might be a better choice
Method 3: how to use apply() function efficiently
under the hood, the popular apply() function is a for loop with some overhead; when possible, we can leverage benefits of vectorized operations; we take a conditional multiplication as an example here
Reprint policy:
All articles in this blog are used except for special statements
CC BY 4.0
reprint policy. If reproduced, please indicate source
robot learner
!