This is a short summary of the most common funciton when starting to analyse a dataset with python and pandas. This is a very datascientist function summary. This summary was created following the kaggle

1. Requirements

For Datascience using python, we need 2 basic libraries : numpy and pandas

import pandas as pd
import numpy as np

2. Look at your data

Load your dataset

my_data = pd.read_csv("../my_dataset.csv")

Look at random rows

# set "seed" first. Allows to reproduce the same random numbers
# This way, running `df.sample(10)` over and over you will get the same sequence


If you prefer, you can also work on a subset of your data, for instance the 8 first columns

subset_my_data = my_data.loc[:, 'column1':'column8'].head()

3. Missing data points

There is always missing values. How many in each column ?

missing_values_count = my_data.isnull().sum()

# or in percentage
total_cells = np.product(my_data.shape)
total_missing = missing_values_count.sum()
(total_missing/total_cells) * 100

4. Think or Drop

Here you have 2 options.

  1. no time : drop it
  2. think why the data is missing => wasn’t recorded or doesn’t exist?
1. DROP null values
# No time to investigate, so we drop every row where one data is missing

If we end up with nothing, it may be smarter to remove the “almost empty” columns first

columns_with_na_dropped = my_data.dropna(axis=1)

Note that to know how many columns you cut, you can do

print("Columns before: %d" % my_data.shape[1], " - Columns after: %d" % columns_with_na_dropped.shape[1])
2. THINK : Fill null values “imputation
# fill with 0

# fill with next value in the same column
my_data_partially_filled = my_data.fillna(method = 'bfill', axis=0)
my_data_partially_filled.fillna(0) # fill remaining one with 0


Kaggle original challenge: notebook
More on imputation