## Why ?

I have a pandas dataframe, with a lot of rows. I want to **create a new column based on the other columns**.

Tested Configuration:

`MacOS: Sierra 10.12`

`Pandas: 0.23.3`

`Python: 3.0`

## Create the dataframe

We want simple **1 column** dataframe with **1 million rows**.

```
import pandas as pd, numpy as np
df = pd.DataFrame(np.random.randint(low=0, high=10, size=(1000000)), columns=['column_1'])
```

# The BAD way

If you develop, you will intuitively use a **row by row** pattern, like this:

```
new_results = {}
for index, row in df.iterrows():
row["column_2"] = 'high' if row["column_1"] > 5 else 'low' if row["column_1"] > 0 else 'null'
new_results[index] = dict(row)
df = pd.DataFrame.from_dict(new_results, orient='index')
```

it works. but…

### It’s so SLOW

The big drawback from this way of doing is the time it takes to execute the loop. Going through every single row takes a long time, simply because there’s a lot of rows. This solution is fine for smaller dataframes, but not here.

# The GOOD way

```
conditions = [
(df['column_1'] > 5),
(df['column_1'] <= 5) & (df['column_1'] > 0),
(df['column_1'] == 0)]
choices = ['high','low','null']
df['column_2'] = np.select(conditions, choices, default='null')
```

This is 100 times faster !

### Why

It’s musch faster form 2 reasons:

- Pandas is a vector library, so columns processing is way more optimised than “row by row”
- Numpy is designed for large matrices and its mathematical functions are very optimised

# Ressources

Original answer from stackoverflow

Create a dataframe

Dig into Pandas