get_dummies() and reindex() - Machine Learning

A Beginner-Friendly Tutorial on Handling Categorical Data During Training and Prediction

Introduction

When building Machine Learning models, one of the most common challenges is handling categorical (text) data such as:

Gender
City
Department
Product Type

Machine Learning algorithms work with numbers, not text. Therefore, categorical values must be converted into numerical form before training a model.

In Pandas, the most commonly used method for this is:

pd.get_dummies()

However, a new challenge appears during prediction when real users provide data. The columns generated during prediction may not match the columns used during training.

To solve this problem, Pandas provides:

reindex()

This tutorial explains why get_dummies() and reindex() are used together and how they ensure that training data and prediction data have the same structure.

Suppose a company wants to predict whether a customer will purchase a product.

The dataset contains:

Feature	Type
Gender	Categorical
City	Categorical
Age	Numerical

Target:

Buy Product
0 = No
1 = Yes

Step 1: Creating the Training Dataset

import pandas as pd

train_df = pd.DataFrame({
    'Gender':['Male','Female','Male','Female'],
    'City':['Bangalore','Chennai','Hyderabad','Bangalore'],
    'Age':[25,30,35,28],
    'Buy':[1,0,1,0]
})

train_df

Training Data

Gender	City	Age	Buy
Male	Bangalore	25	1
Female	Chennai	30	0
Male	Hyderabad	35	1
Female	Bangalore	28	0

Step 2: Converting Categorical Data Using get_dummies()

The model cannot understand text values like:

Male
Female
Bangalore
Chennai

These values must be converted into numbers.

X_train = pd.get_dummies(
    train_df[['Gender','City','Age']]
)

X_train

Output

Age	Gender_Female	Gender_Male	City_Bangalore	City_Chennai	City_Hyderabad
25	0	1	1	0	0
30	1	0	0	1	0
35	0	1	0	0	1
28	1	0	1	0	0

Understanding What Happened

Gender Column

Original:

Male
Female

Converted into:

Gender_Female	Gender_Male
0	1
1	0

City Column

Original:

Bangalore
Chennai
Hyderabad

Converted into:

City_Bangalore	City_Chennai	City_Hyderabad
1	0	0
0	1	0
0	0	1

This process is called One-Hot Encoding.

Step 3: Training the Model

from sklearn.linear_model import LogisticRegression

X = X_train
y = train_df['Buy']

model = LogisticRegression()

model.fit(X, y)

The model is now trained.

Important Observation

The model has learned using these columns:

X_train.columns

Output:

Age
Gender_Female
Gender_Male
City_Bangalore
City_Chennai
City_Hyderabad

These columns become the model's expected input structure.

Saving the Training Structure

The training columns should be stored for future predictions.

train_columns = X_train.columns

This is a very important step.

Step 4: Real-Time User Prediction

Imagine a user opens a web application and enters:

Gender = Male
City = Chennai
Age = 40

The application receives:

test_df = pd.DataFrame({
    'Gender':['Male'],
    'City':['Chennai'],
    'Age':[40]
})

test_df

Step 5: Applying get_dummies() on User Data

X_test = pd.get_dummies(test_df)

X_test

Output:

Age	Gender_Male	City_Chennai
40	1	1

The Problem

Training data had:

Age
Gender_Female
Gender_Male
City_Bangalore
City_Chennai
City_Hyderabad

User data has:

Age
Gender_Male
City_Chennai

Missing columns:

Gender_Female
City_Bangalore
City_Hyderabad

The model expects 6 columns but receives only 3 columns.

Prediction will fail.

Why Are Columns Missing?

The user selected:

Male
Chennai

Therefore:

Female does not exist
Bangalore does not exist
Hyderabad does not exist

Since those categories are absent, get_dummies() does not create those columns.

Step 6: Using reindex() to Align Columns

To make the prediction data identical to the training data:

X_test = X_test.reindex(
    columns=train_columns,
    fill_value=0
)

Output After Reindex

Age	Gender_Female	Gender_Male	City_Bangalore	City_Chennai	City_Hyderabad
40	0	1	0	1	0

What Did reindex() Do?

Added Missing Columns

Added:

Gender_Female
City_Bangalore
City_Hyderabad

Filled Missing Values

Because the user selected:

Male
Chennai

the missing columns become:

Gender_Female = 0
City_Bangalore = 0
City_Hyderabad = 0

Arranged Columns in the Correct Order

Before:

Age
Gender_Male
City_Chennai

After:

Age
Gender_Female
Gender_Male
City_Bangalore
City_Chennai
City_Hyderabad

Now the structure exactly matches the training dataset.

Visual Representation

Training Structure

Age
Gender_Female
Gender_Male
City_Bangalore
City_Chennai
City_Hyderabad

↓

User Input After get_dummies()

Age
Gender_Male
City_Chennai

↓

After reindex()

Age
Gender_Female
Gender_Male
City_Bangalore
City_Chennai
City_Hyderabad

✓ Same structure as training data

✓ Safe for prediction

Making the Prediction

prediction = model.predict(X_test)

print(prediction)

Output:

[1]

The model can now make predictions successfully because the feature structure matches the training data.

Real-World Workflow

Training Phase

X_train = pd.get_dummies(
    train_df[['Gender','City','Age']]
)

train_columns = X_train.columns

model.fit(X_train, y_train)

Prediction Phase

user_input = pd.DataFrame({
    'Gender':['Male'],
    'City':['Chennai'],
    'Age':[40]
})

X_test = pd.get_dummies(user_input)

X_test = X_test.reindex(
    columns=train_columns,
    fill_value=0
)

prediction = model.predict(X_test)

Key Learning Points

get_dummies()

Converts categorical values into numerical columns.
Creates one column for each category.
Used during both training and prediction.

reindex()

Ensures prediction data has the same columns as training data.
Adds missing columns automatically.
Fills missing columns with 0.
Maintains the same column order.
Prevents prediction errors.

Summary

Machine Learning models remember the feature structure used during training. When a user provides new data, the generated dummy columns may not match the training columns. Using:

X_test = X_test.reindex(
    columns=train_columns,
    fill_value=0
)

guarantees that the prediction data looks exactly like the training data.

Think of get_dummies() as the process that creates the columns, and reindex() as the process that arranges and completes those columns so the model receives data in the format it expects.

This simple technique is used in countless real-world Machine Learning applications, including web apps, APIs, dashboards, and production prediction systems.