The simplest way I found for performing OneHot encoding is using pandas.get_dummies method. In just one line of code you can OneHot encode selected columns in a pandas dataframe. Also, it returns the dataframe with other columns intact. The renaming of new columns of the encoded data is taken care off too.
Let’s start by importing NumPy, pandas and creating a small example dataframe.
import numpy as np
import pandas as pd
df = pd.DataFrame({
'col1': [1,2,3,4,5],
'col2': ['A', 'B', 'B', 'A', 'A'],
'col3': [1, 3, 2, 3, 2]
})
print(df)
col1 col2 col3
0 1 A 1
1 2 B 3
2 3 B 2
3 4 A 3
4 5 A 2Simple use
To OneHot encode the col2, do as follows:
pd.get_dummies(df, columns=['col2'], prefix='col2') col1 col3 col2_A col2_B
0 1 1 True False
1 2 3 False True
2 3 2 False True
3 4 3 False False
4 5 2 True FalseThe columns argument takes a list of column names in the dataframe.
The prefix argument take a string or list of prefixes we want to have for new column names that would be created.
The col2 in our example dataframe has two unique values, A and B. The new columns are named as per the formula, prefix + '_' + unique_value.
Following is how we convert two columns to OneHot encoded form using a list as the prefix argument.
df_onehot = pd.get_dummies(df, columns=['col2', 'col3'], prefix=['col2', 'col3'])
print(df_onehot) col1 col2_A col2_B col3_1 col3_2 col3_3
0 1 True False True False False
1 2 False True False False True
2 3 False True False True False
3 4 False False False False True
4 5 True False False True FalseWe see above that the values created by OneHot encoding are in form True or False. If we want them to be as 1 or 0 form, we use the argument dtype.
df_onehot = pd.get_dummies(df,
columns=['col2', 'col3'],
prefix=['col2', 'col3'],
dtype=np.int32)
print(df_onehot) col1 col2_A col2_B col3_1 col3_2 col3_3
0 1 1 0 1 0 0
1 2 0 1 0 0 1
2 3 0 1 0 1 0
3 4 0 0 0 0 1
4 5 1 0 0 1 0Now, the encoded values are 1 or 0.
prefix_sep
The default separator of the prefix and unique_value is and underscore, _. If needed this can be changed using prefix_sep argument.
df_onehot = pd.get_dummies(df,
columns=['col2', 'col3'],
prefix=['col2', 'col3'],
dtype=np.int32,
prefix_sep=':'
)
print(df_onehot) col1 col2:A col2:B col3:1 col3:2 col3:3
0 1 1 0 1 0 0
1 2 0 1 0 0 1
2 3 0 1 0 1 0
3 4 0 0 0 0 1
4 5 1 0 0 1 0Now the column names are in form prefix + ':' + unique_value
drop_first
If we want n-1 columns for n unique values in a column, use drop_first argument.
df_onehot = pd.get_dummies(df,
columns=['col2', 'col3'],
prefix=['col2', 'col3'],
dtype=np.int32,
prefix_sep=':',
drop_first=True
)
print(df_onehot) col1 col2:B col3:2 col3:3
0 1 0 0 0
1 2 1 0 1
2 3 1 1 0
3 4 0 0 1
4 5 0 1 0dummy_na
If we want separate column for NaN values, we use dummy_na argument.
df = pd.DataFrame({
'col1': [1,2,3,4,5],
'col2': ['A', 'B', 'B', np.nan, 'A'],
'col3': [1, 3, 2, 3, 2]
})
print(df) col1 col2 col3
0 1 A 1
1 2 B 3
2 3 B 2
3 4 NaN 3
4 5 A 2df_onehot = pd.get_dummies(df,
columns=['col2', 'col3'],
prefix=['col2', 'col3'],
dtype=np.int32,
prefix_sep=':',
dummy_na=True
)
print(df_onehot) col1 col2:A col2:B col2:nan col3:1.0 col3:2.0 col3:3.0 col3:nan
0 1 1 0 0 1 0 0 0
1 2 0 1 0 0 0 1 0
2 3 0 1 0 0 1 0 0
3 4 0 0 1 0 0 1 0
4 5 1 0 0 0 1 0 0