Why ?
I have a pandas dataframe, with a lot of rows. I want to create a new column based on the other columns.
Tested Configuration:
MacOS: Sierra 10.12
Pandas: 0.23.3
Python: 3.0
Create the dataframe
We want simple 1 column dataframe with 1 million rows.
import pandas as pd, numpy as np
df = pd.DataFrame(np.random.randint(low=0, high=10, size=(1000000)), columns=['column_1'])
The BAD way
If you develop, you will intuitively use a row by row pattern, like this:
new_results = {}
for index, row in df.iterrows():
row["column_2"] = 'high' if row["column_1"] > 5 else 'low' if row["column_1"] > 0 else 'null'
new_results[index] = dict(row)
df = pd.DataFrame.from_dict(new_results, orient='index')
it works. but…
It’s so SLOW
The big drawback from this way of doing is the time it takes to execute the loop. Going through every single row takes a long time, simply because there’s a lot of rows. This solution is fine for smaller dataframes, but not here.
The GOOD way
conditions = [
(df['column_1'] > 5),
(df['column_1'] <= 5) & (df['column_1'] > 0),
(df['column_1'] == 0)]
choices = ['high','low','null']
df['column_2'] = np.select(conditions, choices, default='null')
This is 100 times faster !
Why
It’s musch faster form 2 reasons:
- Pandas is a vector library, so columns processing is way more optimised than “row by row”
- Numpy is designed for large matrices and its mathematical functions are very optimised
Ressources
Original answer from stackoverflow
Create a dataframe
Dig into Pandas