Tutorial: (Robust) One Hot Encoding in Python

Kevin Lemagnen
Cambridge Spark
Published in
6 min readSep 10, 2018

--

One hot encoding is a common technique used to work with categorical features. There are multiple tools available to facilitate this pre-processing step in Python, but it usually becomes much harder when you need your code to work on new data that might have missing or additional values.

That's the case if you want to deploy a model to production for instance, sometimes you don't know what new values will appear in the data you receive.

In this tutorial we will present two ways of dealing with this problem. Everytime, we will first run one hot encoding on our training set and save a few attributes that we can reuse later on, when we need to process new data.

If you deploy a model to production, the best way of saving those values is writing your own class and define them as attributes that will be set at training, as an internal state.

If you’re working in a notebook, it’s fine to save them as simple variables.

Let’s create a new dataset

Let’s make up a dataset containing journeys that happened in different cities in the UK, using different ways of transportation.

We’ll create a new DataFrame that contains two categorical features, city and transport , as well as a numerical feature duration for the duration of the journey in minutes.

import pandas as pd

df = pd.DataFrame([["London", "car", 20],
["Cambridge", "car", 10],
["Liverpool", "bus", 30]],
columns=["city", "transport", "duration"]

Now let’s create our ‘unseen’ test data. To make it difficult, we will simulate the case where the test data has different values for the categorical features.

df_test = pd.DataFrame([["Manchester", "bike", 30], 
["Cambridge", "car", 40],
["Liverpool", "bike", 10]],
columns=["city", "transport", "duration"])

Here our column city does not have the value London but has a new value Cambridge. Our column transport has no value bus but the new value bike. Let's see how we can build one hot encoded features for those datasets!

We’ll show two different methods, one using the get_dummies method from pandas, and the other with the OneHotEncoder class from sklearn.

Using pandas’ get_dummies

Process our training data

First we define the list of categorical features that we will want to process:

cat_columns = ["city", "transport"]

We can really quickly build dummy features with pandas by calling the get_dummies function. Let's create a new DataFrame for our processed data:

df_processed = pd.get_dummies(df, prefix_sep="__",
columns=cat_columns)
Our df_processed DataFrame

That’s it for the training set part, now you have a DataFrame with one hot encoded features. We will need to save a few things into variables to make sure that we build the exact same columns on the test dataset.

See how pandas created new columns with the following format: <column__value>. Let’s create a list that looks for those new columns and store them in a new variable cat_dummies.

cat_dummies = [col for col in df_processed 
if "__" in col
and col.split("__")[0] in cat_columns]

Let’s also save the list of columns so we can enforce the order of columns later on.

processed_columns = list(df_processed.columns[:])

Process our unseen (test) data!

Now let’s see how to ensure our test data has the same columns, first let’s call get_dummies on it:

df_test_processed = pd.get_dummies(df_test, prefix_sep="__", 
columns=cat_columns)

Let’s look at our new dataset:

Our df_test_processed DataFrame

As expected we have new columns (city__Manchester) and missing ones (transport__bus). But we can easily clean it up!

# Remove additional columns
for col in df_test_processed.columns:
if ("__" in col) and (col.split("__")[0] in cat_columns) and col not in cat_dummies:
print("Removing additional feature {}".format(col))
df_test_processed.drop(col, axis=1, inplace=True)

Now we need to add the missing columns. We can set all missing columns to a vector of 0s since those values did not appear in the test data.

for col in cat_dummies:
if col not in df_test_processed.columns:
print("Adding missing feature {}".format(col))
df_test_processed[col] = 0
Our df_test_processed DataFrame

That’s it, we now have the same features. Note that the order of the columns isn’t kept though, if you need to reorder the columns, reuse the list of processed columns we saved earlier:

df_test_processed = df_test_processed[processed_columns]

All good! Now let’s see how to do the same with sklearn and the OneHotEncoder

Using sklearn’s one hot and label encoder

Process our training data

Let’s start by importing what we need. The OneHotEncoder to build one hot features, but also the LabelEncoder to transform strings into integer labels (needed before using the OneHotEncoder)

from sklearn.preprocessing import LabelEncoder, OneHotEncoder

We’re starting again from our initial dataframe and our list of categorical features.

First let’s create our df_processed DataFrame, we can take all the non-categorical features to start with:

Our df_processed DataFrame

Now we need to encode every categorical feature separately, meaning we need as many encoders as categorical features. Let’s loop over all categorical features and build a dictionary that will map a feature to its encoder:

# For each categorical column
# We fit a label encoder, transform our column and
# add it to our new dataframe
label_encoders = {}
for col in cat_columns:
print("Encoding {}".format(col))
new_le = LabelEncoder()
df_processed[col] = new_le.fit_transform(df[col])
label_encoders[col] = new_le
Our df_processed DataFrame

Now that we have proper integer labels, we need to one hot encode our categorical features.

Unfortunately, the one hot encoder does not support passing the list of categorical features by their names but only by their indexes, so let’s get a new list, now with indexes. We can use the get_loc method to get the index of each of our categorical columns:

cat_columns_idx = [df_processed.columns.get_loc(col) 
for col in cat_columns]

We’ll need to specify handle_unknown as ignore so the OneHotEncoder can work later on with our unseen data. The OneHotEncoder will build a numpy array for our data, replacing our original features by one hot encoding versions. Unfortunately it can be hard to re-build the DataFrame with nice labels, but most algorithms work with numpy arrays, so we can stop there.

ohe = OneHotEncoder(categorical_features=cat_columns_idx, 
sparse=False, handle_unknown="ignore")
df_processed_np = ohe.fit_transform(df_processed)

Process our unseen (test) data

Now we need to apply the same steps on our test data; first create a new dataframe with our non-categorical features:

df_test_processed = df_test[[col for col in df_test.columns 
if col not in cat_columns]]

Now we need to reuse our LabelEncoders to properly assign the same integer to the same values. Unfortunately since we have new, unseen, values in our test dataset, we cannot use transform. Instead we will create a new dictionary from the classes_ defined in our label encoder. Those classes map a value to an integer. If we then use map on our pandas Series, it set the new values as NaN and convert the type to float.

Here we will add a new step that fills the NaN by a huge integer, say 9999 and converts the column to int.

for col in cat_columns:
print("Encoding {}".format(col))
label_map = {val: label for label, val in enumerate(label_encoders[col].classes_)}
print(label_map)
df_test_processed[col] = df_test[col].map(label_map)
# fillna and convert to int
df_test_processed[col] = df_test_processed[col].fillna(9999).astype(int)
Our df_test_processed DataFrame

Looks good, now we can finally apply our fitted OneHotEncoder "out-of-the-box" by using the transform method:

df_test_processed_np = ohe.transform(df_test_processed)

Double check that it has the same columns as the pandas version!

Note: original notebook is available here

Thanks for reading! If you found this tutorial useful, we’d appreciate your support by clicking the clap (👏🏼) button below or by sharing this article so others can find it.

Keep a look out for our new upcoming tutorials! Busy schedule? Be sure to follow us on Medium and register for our Data Science newsletter by clicking here to never miss out.

--

--