Why ?

I have a pandas dataframe, with a lot of rows. I want to create a new column based on the other columns.

Tested Configuration:
MacOS: Sierra 10.12
Pandas: 0.23.3
Python: 3.0

Create the dataframe

We want simple 1 column dataframe with 1 million rows.

import pandas as pd, numpy as np
df = pd.DataFrame(np.random.randint(low=0, high=10, size=(1000000)), columns=['column_1'])

The BAD way

If you develop, you will intuitively use a row by row pattern, like this:

new_results = {}

for index, row in df.iterrows():
    row["column_2"] = 'high' if row["column_1"] > 5 else 'low' if row["column_1"] > 0 else 'null'
    new_results[index] = dict(row)

df = pd.DataFrame.from_dict(new_results, orient='index')

it works. but…

It’s so SLOW

The big drawback from this way of doing is the time it takes to execute the loop. Going through every single row takes a long time, simply because there’s a lot of rows. This solution is fine for smaller dataframes, but not here.

The GOOD way

conditions = [
    (df['column_1'] > 5),
    (df['column_1'] <= 5) & (df['column_1'] > 0),
    (df['column_1'] == 0)]
choices = ['high','low','null']

df['column_2'] = np.select(conditions, choices, default='null')

This is 100 times faster !


It’s musch faster form 2 reasons:


Original answer from stackoverflow

Create a dataframe

Dig into Pandas