OneHot encoding of columns in a pandas dataframe using get_dummies.

The simplest way I found for performing OneHot encoding is using pandas.get_dummies method. In just one line of code you can OneHot encode selected columns in a pandas dataframe. Also, it returns the dataframe with other columns intact. The renaming of new columns of the encoded data is taken care off too.

Let’s start by importing NumPy, pandas and creating a small example dataframe.

Python
import numpy as np
import pandas as pd
df = pd.DataFrame({
    'col1': [1,2,3,4,5],
    'col2': ['A', 'B', 'B', 'A', 'A'],
    'col3': [1, 3, 2, 3, 2]
})

print(df)
	
output
       col1 col2  col3
    0     1    A     1
    1     2    B     3
    2     3    B     2
    3     4    A     3
    4     5    A     2

Simple use

To OneHot encode the col2, do as follows:

Python
pd.get_dummies(df, columns=['col2'], prefix='col2')
output
       col1  col3  col2_A  col2_B
    0     1     1    True   False
    1     2     3   False    True
    2     3     2   False    True
    3     4     3   False   False
    4     5     2    True   False

The columns argument takes a list of column names in the dataframe.

The prefix argument take a string or list of prefixes we want to have for new column names that would be created.

The col2 in our example dataframe has two unique values, A and B. The new columns are named as per the formula, prefix + '_' + unique_value.

Following is how we convert two columns to OneHot encoded form using a list as the prefix argument.

Python
df_onehot = pd.get_dummies(df, columns=['col2', 'col3'], prefix=['col2', 'col3'])
print(df_onehot)
output
       col1  col2_A  col2_B  col3_1  col3_2  col3_3
    0     1    True   False    True   False   False
    1     2   False    True   False   False    True
    2     3   False    True   False    True   False
    3     4   False   False   False   False    True
    4     5    True   False   False    True   False

We see above that the values created by OneHot encoding are in form True or False. If we want them to be as 1 or 0 form, we use the argument dtype.

Python
df_onehot = pd.get_dummies(df, 
               columns=['col2', 'col3'], 
               prefix=['col2', 'col3'], 
               dtype=np.int32)

print(df_onehot)
output
      col1  col2_A  col2_B  col3_1  col3_2  col3_3
    0     1       1       0       1       0       0
    1     2       0       1       0       0       1
    2     3       0       1       0       1       0
    3     4       0       0       0       0       1
    4     5       1       0       0       1       0

Now, the encoded values are 1 or 0.

prefix_sep

The default separator of the prefix and unique_value is and underscore, _. If needed this can be changed using prefix_sep argument.

Python
df_onehot = pd.get_dummies(df,
               columns=['col2', 'col3'], 
               prefix=['col2', 'col3'], 
               dtype=np.int32,
               prefix_sep=':'
              )

print(df_onehot)
output
       col1  col2:A  col2:B  col3:1  col3:2  col3:3
    0     1       1       0       1       0       0
    1     2       0       1       0       0       1
    2     3       0       1       0       1       0
    3     4       0       0       0       0       1
    4     5       1       0       0       1       0

Now the column names are in form prefix + ':' + unique_value

drop_first

If we want n-1 columns for n unique values in a column, use drop_first argument.

Python
df_onehot = pd.get_dummies(df,
               columns=['col2', 'col3'], 
               prefix=['col2', 'col3'], 
               dtype=np.int32,
               prefix_sep=':',
               drop_first=True
              )

print(df_onehot)
output
       col1  col2:B  col3:2  col3:3
    0     1       0       0       0
    1     2       1       0       1
    2     3       1       1       0
    3     4       0       0       1
    4     5       0       1       0

dummy_na

If we want separate column for NaN values, we use dummy_na argument.

Python
df = pd.DataFrame({
    'col1': [1,2,3,4,5],
    'col2': ['A', 'B', 'B', np.nan, 'A'],
    'col3': [1, 3, 2, 3, 2]
})

print(df)
output
       col1 col2  col3
    0     1    A     1
    1     2    B     3
    2     3    B     2
    3     4  NaN     3
    4     5    A     2
Python
df_onehot = pd.get_dummies(df,
               columns=['col2', 'col3'], 
               prefix=['col2', 'col3'], 
               dtype=np.int32,
               prefix_sep=':',
               dummy_na=True
              )

print(df_onehot)
output
       col1  col2:A  col2:B  col2:nan  col3:1.0  col3:2.0  col3:3.0  col3:nan
    0     1       1       0         0         1         0         0         0
    1     2       0       1         0         0         0         1         0
    2     3       0       1         0         0         1         0         0
    3     4       0       0         1         0         0         1         0
    4     5       1       0         0         0         1         0         0