OneHot encoding of columns in a pandas dataframe using get_dummies.

The simplest way I found for performing OneHot encoding is using pandas.get_dummies method. In just one line of code you can OneHot encode selected columns in a pandas dataframe. Also, it returns the dataframe with other columns intact. The renaming of new columns of the encoded data is taken care off too.

Let’s start by importing NumPy, pandas and creating a small example dataframe.

Python

import numpy as np
import pandas as pd
df = pd.DataFrame({
    'col1': [1,2,3,4,5],
    'col2': ['A', 'B', 'B', 'A', 'A'],
    'col3': [1, 3, 2, 3, 2]
})

print(df)

import numpy as np
import pandas as pd
df = pd.DataFrame({
    'col1': [1,2,3,4,5],
    'col2': ['A', 'B', 'B', 'A', 'A'],
    'col3': [1, 3, 2, 3, 2]
})

print(df)

output

       col1 col2  col3
    0     1    A     1
    1     2    B     3
    2     3    B     2
    3     4    A     3
    4     5    A     2

       col1 col2  col3
    0     1    A     1
    1     2    B     3
    2     3    B     2
    3     4    A     3
    4     5    A     2

Simple use

To OneHot encode the col2, do as follows:

Python

pd.get_dummies(df, columns=['col2'], prefix='col2')

pd.get_dummies(df, columns=['col2'], prefix='col2')

output

       col1  col3  col2_A  col2_B
    0     1     1    True   False
    1     2     3   False    True
    2     3     2   False    True
    3     4     3   False   False
    4     5     2    True   False

       col1  col3  col2_A  col2_B
    0     1     1    True   False
    1     2     3   False    True
    2     3     2   False    True
    3     4     3   False   False
    4     5     2    True   False

The columns argument takes a list of column names in the dataframe.

The prefix argument take a string or list of prefixes we want to have for new column names that would be created.

The col2 in our example dataframe has two unique values, A and B. The new columns are named as per the formula, prefix + '_' + unique_value.

Following is how we convert two columns to OneHot encoded form using a list as the prefix argument.

Python

df_onehot = pd.get_dummies(df, columns=['col2', 'col3'], prefix=['col2', 'col3'])
print(df_onehot)

df_onehot = pd.get_dummies(df, columns=['col2', 'col3'], prefix=['col2', 'col3'])
print(df_onehot)

output

       col1  col2_A  col2_B  col3_1  col3_2  col3_3
    0     1    True   False    True   False   False
    1     2   False    True   False   False    True
    2     3   False    True   False    True   False
    3     4   False   False   False   False    True
    4     5    True   False   False    True   False

       col1  col2_A  col2_B  col3_1  col3_2  col3_3
    0     1    True   False    True   False   False
    1     2   False    True   False   False    True
    2     3   False    True   False    True   False
    3     4   False   False   False   False    True
    4     5    True   False   False    True   False

We see above that the values created by OneHot encoding are in form True or False. If we want them to be as 1 or 0 form, we use the argument dtype.

Python

df_onehot = pd.get_dummies(df, 
               columns=['col2', 'col3'], 
               prefix=['col2', 'col3'], 
               dtype=np.int32)

print(df_onehot)

df_onehot = pd.get_dummies(df, 
               columns=['col2', 'col3'], 
               prefix=['col2', 'col3'], 
               dtype=np.int32)

print(df_onehot)

output

      col1  col2_A  col2_B  col3_1  col3_2  col3_3
    0     1       1       0       1       0       0
    1     2       0       1       0       0       1
    2     3       0       1       0       1       0
    3     4       0       0       0       0       1
    4     5       1       0       0       1       0

      col1  col2_A  col2_B  col3_1  col3_2  col3_3
    0     1       1       0       1       0       0
    1     2       0       1       0       0       1
    2     3       0       1       0       1       0
    3     4       0       0       0       0       1
    4     5       1       0       0       1       0

Now, the encoded values are 1 or 0.

`prefix_sep`

The default separator of the prefix and unique_value is and underscore, _. If needed this can be changed using prefix_sep argument.

Python

df_onehot = pd.get_dummies(df,
               columns=['col2', 'col3'], 
               prefix=['col2', 'col3'], 
               dtype=np.int32,
               prefix_sep=':'
              )

print(df_onehot)

df_onehot = pd.get_dummies(df,
               columns=['col2', 'col3'], 
               prefix=['col2', 'col3'], 
               dtype=np.int32,
               prefix_sep=':'
              )

print(df_onehot)

output

       col1  col2:A  col2:B  col3:1  col3:2  col3:3
    0     1       1       0       1       0       0
    1     2       0       1       0       0       1
    2     3       0       1       0       1       0
    3     4       0       0       0       0       1
    4     5       1       0       0       1       0

       col1  col2:A  col2:B  col3:1  col3:2  col3:3
    0     1       1       0       1       0       0
    1     2       0       1       0       0       1
    2     3       0       1       0       1       0
    3     4       0       0       0       0       1
    4     5       1       0       0       1       0

Now the column names are in form prefix + ':' + unique_value

`drop_first`

If we want n-1 columns for n unique values in a column, use drop_first argument.

Python

df_onehot = pd.get_dummies(df,
               columns=['col2', 'col3'], 
               prefix=['col2', 'col3'], 
               dtype=np.int32,
               prefix_sep=':',
               drop_first=True
              )

print(df_onehot)

df_onehot = pd.get_dummies(df,
               columns=['col2', 'col3'], 
               prefix=['col2', 'col3'], 
               dtype=np.int32,
               prefix_sep=':',
               drop_first=True
              )

print(df_onehot)

output

       col1  col2:B  col3:2  col3:3
    0     1       0       0       0
    1     2       1       0       1
    2     3       1       1       0
    3     4       0       0       1
    4     5       0       1       0

       col1  col2:B  col3:2  col3:3
    0     1       0       0       0
    1     2       1       0       1
    2     3       1       1       0
    3     4       0       0       1
    4     5       0       1       0

`dummy_na`

If we want separate column for NaN values, we use dummy_na argument.

Python

df = pd.DataFrame({
    'col1': [1,2,3,4,5],
    'col2': ['A', 'B', 'B', np.nan, 'A'],
    'col3': [1, 3, 2, 3, 2]
})

print(df)

df = pd.DataFrame({
    'col1': [1,2,3,4,5],
    'col2': ['A', 'B', 'B', np.nan, 'A'],
    'col3': [1, 3, 2, 3, 2]
})

print(df)

output

       col1 col2  col3
    0     1    A     1
    1     2    B     3
    2     3    B     2
    3     4  NaN     3
    4     5    A     2

       col1 col2  col3
    0     1    A     1
    1     2    B     3
    2     3    B     2
    3     4  NaN     3
    4     5    A     2

Python

df_onehot = pd.get_dummies(df,
               columns=['col2', 'col3'], 
               prefix=['col2', 'col3'], 
               dtype=np.int32,
               prefix_sep=':',
               dummy_na=True
              )

print(df_onehot)

df_onehot = pd.get_dummies(df,
               columns=['col2', 'col3'], 
               prefix=['col2', 'col3'], 
               dtype=np.int32,
               prefix_sep=':',
               dummy_na=True
              )

print(df_onehot)

output

       col1  col2:A  col2:B  col2:nan  col3:1.0  col3:2.0  col3:3.0  col3:nan
    0     1       1       0         0         1         0         0         0
    1     2       0       1         0         0         0         1         0
    2     3       0       1         0         0         1         0         0
    3     4       0       0         1         0         0         1         0
    4     5       1       0         0         0         1         0         0

       col1  col2:A  col2:B  col2:nan  col3:1.0  col3:2.0  col3:3.0  col3:nan
    0     1       1       0         0         1         0         0         0
    1     2       0       1         0         0         0         1         0
    2     3       0       1         0         0         1         0         0
    3     4       0       0         1         0         0         1         0
    4     5       1       0         0         0         1         0         0

Probably Iris

OneHot encoding of columns in a pandas dataframe using get_dummies.

Simple use

`prefix_sep`

`drop_first`

`dummy_na`

More posts

Generative adversarial networks (GAN)

RNA-Seq Analysis in WSL – Part 3 : Differential expression analysis using Ballgown

RNA-Seq Analysis in WSL – Part 2 : Raw sequence reads to transcript abundance

RNA-Seq Analysis in WSL – Part 1 : Installation of tools

Probably Iris

Tutorials+

Python+

Tensorflow+

OneHot encoding of columns in a pandas dataframe using get_dummies.

Simple use

prefix_sep

drop_first

dummy_na

More posts

Generative adversarial networks (GAN)

RNA-Seq Analysis in WSL – Part 3 : Differential expression analysis using Ballgown

RNA-Seq Analysis in WSL – Part 2 : Raw sequence reads to transcript abundance

RNA-Seq Analysis in WSL – Part 1 : Installation of tools

`prefix_sep`

`drop_first`

`dummy_na`