This is a short summary of the most common funciton when starting to analyse a dataset with python and pandas. This is a very datascientist function summary. This summary was created following the kaggle

1. Requirements

For Datascience using python, we need 2 basic libraries : numpy and pandas

import pandas as pd
import numpy as np

2. Look at your data

Load your dataset

my_data = pd.read_csv("../my_dataset.csv")

Look at random rows

# set "seed" first. Allows to reproduce the same random numbers
# This way, running `df.sample(10)` over and over you will get the same sequence
np.random.seed(0)

my_data.sample(2)

If you prefer, you can also work on a subset of your data, for instance the 8 first columns

subset_my_data = my_data.loc[:, 'column1':'column8'].head()

3. Missing data points

There is always missing values. How many in each column ?

missing_values_count = my_data.isnull().sum()
missing_values_count[0:10]

# or in percentage
total_cells = np.product(my_data.shape)
total_missing = missing_values_count.sum()
(total_missing/total_cells) * 100

4. Think or Drop

Here you have 2 options.

  1. no time : drop it
  2. think why the data is missing => wasn’t recorded or doesn’t exist?
1. DROP null values
# No time to investigate, so we drop every row where one data is missing
my_data.dropna()

If we end up with nothing, it may be smarter to remove the “almost empty” columns first

columns_with_na_dropped = my_data.dropna(axis=1)

Note that to know how many columns you cut, you can do

print("Columns before: %d" % my_data.shape[1], " - Columns after: %d" % columns_with_na_dropped.shape[1])
2. THINK : Fill null values “imputation
# fill with 0
my_data.fillna(0)

# fill with next value in the same column
my_data_partially_filled = my_data.fillna(method = 'bfill', axis=0)
my_data_partially_filled.fillna(0) # fill remaining one with 0

Ressources:

Kaggle original challenge: notebook
More on imputation