missing value or null value processing in pandas dataframe


obtain null or missing values of a dataframe

Suppose the dataframe has the following formats, with 10 rows and 5 clomns:


0 1 2 3 4 5
0 0.520113 0.884000 1.260966 -0.236597 0.312972 -0.196281
1 -0.837552 NaN 0.143017 0.862355 0.346550 0.842952
2 -0.452595 NaN -0.420790 0.456215 1.203459 0.527425
3 0.317503 -0.917042 1.780938 -1.584102 0.432745 0.389797
4 -0.722852 1.704820 -0.113821 -1.466458 0.083002 0.011722
5 -0.622851 -0.251935 -1.498837 NaN 1.098323 0.273814
6 0.329585 0.075312 -0.690209 -3.807924 0.489317 -0.841368
7 -1.123433 -1.187496 1.868894 -2.046456 -0.949718 NaN
8 1.133880 -0.110447 0.050385 -1.158387 0.188222 NaN
9 -0.513741 1.196259 0.704537 0.982395 -0.585040 -1.693810

the isnull() function which would return a dataframe like this:

       0      1      2      3      4      5
0 False False False False False False
1 False True False False False False
2 False True False False False False
3 False False False False False False
4 False False False False False False
5 False False False True False False
6 False False False False False False
7 False False False False False True
8 False False False False False True
9 False False False False False False

following command will select rows that has any null values

df[df.isnull().any(axis=1)]

following command will select columns that has any null values

df[df.columns[df.isna().any()]]

follwoing command will select rows that have null values for a specific column, e.g., column=3

df[df[3].isnull()]

Drop null values

df = pd.DataFrame({“name”: [‘Alfred’, ‘Batman’, ‘Catwoman’],
… “toy”: [np.nan, ‘Batmobile’, ‘Bullwhip’],
… “born”: [pd.NaT, pd.Timestamp(“1940-04-25”),
… pd.NaT]})

>>> df
name toy born
0 Alfred NaN NaT
1 Batman Batmobile 1940-04-25
2 Catwoman Bullwhip NaT

Drop the rows where at least one element is missing.

>>> df.dropna()
name toy born
1 Batman Batmobile 1940-04-25

Drop the columns where at least one element is missing.

>>> df.dropna(axis='columns')
name
0 Alfred
1 Batman
2 Catwoman

Drop the rows where all elements are missing.

>>> df.dropna(how='all')
name toy born
0 Alfred NaN NaT
1 Batman Batmobile 1940-04-25
2 Catwoman Bullwhip NaT

Keep only the rows with at least 2 non-NA values.

>>> df.dropna(thresh=2)
name toy born
1 Batman Batmobile 1940-04-25
2 Catwoman Bullwhip NaT

Define in which columns to look for missing values.

>>> df.dropna(subset=['name', 'toy'])
name toy born
1 Batman Batmobile 1940-04-25
2 Catwoman Bullwhip NaT

Keep the DataFrame with valid entries in the same variable.

>>> df.dropna(inplace=True)
>>> df
name toy born
1 Batman Batmobile 1940-04-25

Fill missing values

Filling missing values using fillna(), replace() and interpolate()

In order to fill null values in a datasets, we use fillna(), replace() and interpolate() function these function replace NaN values with some value of their own. All these function help in filling a null values in datasets of a DataFrame. Interpolate() function is basically used to fill NA values in the dataframe but it uses various interpolation technique to fill the missing values rather than hard-coding the value.

Code #1: Filling null values with a single value


# importing pandas as pd

import pandas as pd



# importing numpy as np

import numpy as np



# dictionary of lists

dict = {'First Score':[100, 90, np.nan, 95],

'Second Score': [30, 45, 56, np.nan],

'Third Score':[np.nan, 40, 80, 98]}



# creating a dataframe from dictionary

df = pd.DataFrame(dict)



# filling missing value using fillna()

df.fillna(0)

Code #2: Filling null values with the previous ones


# importing pandas as pd

import pandas as pd



# importing numpy as np

import numpy as np



# dictionary of lists

dict = {'First Score':[100, 90, np.nan, 95],

'Second Score': [30, 45, 56, np.nan],

'Third Score':[np.nan, 40, 80, 98]}



# creating a dataframe from dictionary

df = pd.DataFrame(dict)



# filling a missing value with

# previous ones

df.fillna(method ='pad')

Code #3: Filling null value with the next ones


# importing pandas as pd

import pandas as pd



# importing numpy as np

import numpy as np



# dictionary of lists

dict = {'First Score':[100, 90, np.nan, 95],

'Second Score': [30, 45, 56, np.nan],

'Third Score':[np.nan, 40, 80, 98]}



# creating a dataframe from dictionary

df = pd.DataFrame(dict)



# filling null value using fillna() function

df.fillna(method ='bfill')

Author: robot learner
Reprint policy: All articles in this blog are used except for special statements CC BY 4.0 reprint policy. If reproduced, please indicate source robot learner !
  TOC