get_dummies() and reindex()
A Beginner-Friendly Tutorial on Handling Categorical Data During Training and Prediction
IntroductionWhen building Machine Learning models, one of the most common challenges is handling categorical (text) data such as:
-
Gender
-
City
-
Department
-
Product Type
Machine Learning algorithms work with numbers, not text. Therefore, categorical values must be converted into numerical form before training a model.
In Pandas, the most commonly used method for this is:
pd.get_dummies()
However, a new challenge appears during prediction when real users provide data. The columns generated during prediction may not match the columns used during training.
To solve this problem, Pandas provides:
reindex()
This tutorial explains why get_dummies() and reindex() are used together and how they ensure that training data and prediction data have the same structure.
Suppose a company wants to predict whether a customer will purchase a product.
The dataset contains:
| Feature | Type |
|---|---|
| Gender | Categorical |
| City | Categorical |
| Age | Numerical |
Target:
| Buy Product |
|---|
| 0 = No |
| 1 = Yes |
Step 1: Creating the Training Dataset
import pandas as pd
train_df = pd.DataFrame({
'Gender':['Male','Female','Male','Female'],
'City':['Bangalore','Chennai','Hyderabad','Bangalore'],
'Age':[25,30,35,28],
'Buy':[1,0,1,0]
})
train_df
Training Data
| Gender | City | Age | Buy |
|---|---|---|---|
| Male | Bangalore | 25 | 1 |
| Female | Chennai | 30 | 0 |
| Male | Hyderabad | 35 | 1 |
| Female | Bangalore | 28 | 0 |
Step 2: Converting Categorical Data Using get_dummies()
The model cannot understand text values like:
Male
Female
Bangalore
Chennai
These values must be converted into numbers.
X_train = pd.get_dummies(
train_df[['Gender','City','Age']]
)
X_train
Output
| Age | Gender_Female | Gender_Male | City_Bangalore | City_Chennai | City_Hyderabad |
|---|---|---|---|---|---|
| 25 | 0 | 1 | 1 | 0 | 0 |
| 30 | 1 | 0 | 0 | 1 | 0 |
| 35 | 0 | 1 | 0 | 0 | 1 |
| 28 | 1 | 0 | 1 | 0 | 0 |
Understanding What Happened
Gender Column
Original:
Male
Female
Converted into:
| Gender_Female | Gender_Male |
|---|---|
| 0 | 1 |
| 1 | 0 |
City Column
Original:
Bangalore
Chennai
Hyderabad
Converted into:
| City_Bangalore | City_Chennai | City_Hyderabad |
|---|---|---|
| 1 | 0 | 0 |
| 0 | 1 | 0 |
| 0 | 0 | 1 |
This process is called One-Hot Encoding.
Step 3: Training the Model
from sklearn.linear_model import LogisticRegression
X = X_train
y = train_df['Buy']
model = LogisticRegression()
model.fit(X, y)
The model is now trained.
Important Observation
The model has learned using these columns:
X_train.columns
Output:
Age
Gender_Female
Gender_Male
City_Bangalore
City_Chennai
City_Hyderabad
These columns become the model's expected input structure.
Saving the Training Structure
The training columns should be stored for future predictions.
train_columns = X_train.columns
This is a very important step.
Step 4: Real-Time User Prediction
Imagine a user opens a web application and enters:
Gender = Male
City = Chennai
Age = 40
The application receives:
test_df = pd.DataFrame({
'Gender':['Male'],
'City':['Chennai'],
'Age':[40]
})
test_df
Step 5: Applying get_dummies() on User Data
X_test = pd.get_dummies(test_df)
X_test
Output:
| Age | Gender_Male | City_Chennai |
|---|---|---|
| 40 | 1 | 1 |
The Problem
Training data had:
Age
Gender_Female
Gender_Male
City_Bangalore
City_Chennai
City_Hyderabad
User data has:
Age
Gender_Male
City_Chennai
Missing columns:
Gender_Female
City_Bangalore
City_Hyderabad
The model expects 6 columns but receives only 3 columns.
Prediction will fail.
Why Are Columns Missing?
The user selected:
Male
Chennai
Therefore:
Female does not exist
Bangalore does not exist
Hyderabad does not exist
Since those categories are absent, get_dummies() does not create those columns.
Step 6: Using reindex() to Align Columns
To make the prediction data identical to the training data:
X_test = X_test.reindex(
columns=train_columns,
fill_value=0
)
Output After Reindex
| Age | Gender_Female | Gender_Male | City_Bangalore | City_Chennai | City_Hyderabad |
|---|---|---|---|---|---|
| 40 | 0 | 1 | 0 | 1 | 0 |
What Did reindex() Do?
Added Missing Columns
Added:
Gender_Female
City_Bangalore
City_Hyderabad
Filled Missing Values
Because the user selected:
Male
Chennai
the missing columns become:
Gender_Female = 0
City_Bangalore = 0
City_Hyderabad = 0
Arranged Columns in the Correct Order
Before:
Age
Gender_Male
City_Chennai
After:
Age
Gender_Female
Gender_Male
City_Bangalore
City_Chennai
City_Hyderabad
Now the structure exactly matches the training dataset.
Visual Representation
Training Structure
Age
Gender_Female
Gender_Male
City_Bangalore
City_Chennai
City_Hyderabad
↓
User Input After get_dummies()
Age
Gender_Male
City_Chennai
↓
After reindex()
Age
Gender_Female
Gender_Male
City_Bangalore
City_Chennai
City_Hyderabad
✓ Same structure as training data
✓ Safe for prediction
Making the Prediction
prediction = model.predict(X_test)
print(prediction)
Output:
[1]
The model can now make predictions successfully because the feature structure matches the training data.
Real-World Workflow
Training Phase
X_train = pd.get_dummies(
train_df[['Gender','City','Age']]
)
train_columns = X_train.columns
model.fit(X_train, y_train)
Prediction Phase
user_input = pd.DataFrame({
'Gender':['Male'],
'City':['Chennai'],
'Age':[40]
})
X_test = pd.get_dummies(user_input)
X_test = X_test.reindex(
columns=train_columns,
fill_value=0
)
prediction = model.predict(X_test)
Key Learning Points
get_dummies()
-
Converts categorical values into numerical columns.
-
Creates one column for each category.
-
Used during both training and prediction.
reindex()
-
Ensures prediction data has the same columns as training data.
-
Adds missing columns automatically.
-
Fills missing columns with 0.
-
Maintains the same column order.
-
Prevents prediction errors.
Summary
Machine Learning models remember the feature structure used during training. When a user provides new data, the generated dummy columns may not match the training columns. Using:
X_test = X_test.reindex(
columns=train_columns,
fill_value=0
)
guarantees that the prediction data looks exactly like the training data.
Think of get_dummies() as the process that creates the columns, and reindex() as the process that arranges and completes those columns so the model receives data in the format it expects.
This simple technique is used in countless real-world Machine Learning applications, including web apps, APIs, dashboards, and production prediction systems.