Multinomial Naive Bayes
Multinomial Naive Bayes is a type of Naive Bayes algorithm mainly used when features represent counts or frequencies.
It is commonly used in text classification problems such as:
Spam Detection
Sentiment Analysis
News Classification
Document Categorization
In text data, features usually represent:
How many times a word appears in a document
That is why Multinomial Naive Bayes is very useful for text classification.
Why Do We Need Multinomial Naive Bayes?
In Gaussian Naive Bayes, we used continuous numerical values such as:
Height
Weight
Age
Salary
But in text classification, the data is different.
Example email:
"free free offer winner"
Here, we count word frequency:
| Word | Count |
|---|---|
| free | 2 |
| offer | 1 |
| winner | 1 |
So the features are not continuous values. They are word counts.
For this type of data, we use:
Multinomial Naive Bayes
Core Idea
Multinomial Naive Bayes predicts the class by checking:
Which class is more likely to generate these word counts?
For example, if an email contains:
free offer winner
Multinomial Naive Bayes checks:
P(Spam | free, offer, winner)
and
P(Not Spam | free, offer, winner)
Then it chooses the class with the higher probability.
Bayes Theorem in Multinomial Naive Bayes
The basic formula is:
P(A|B) = P(B|A) P(A) / P(B)
For classification, we use:
P(Class | Words) ∝ P(Class) × P(word1 | Class) × P(word2 | Class) × ... × P(wordn | Class)
The denominator is ignored during comparison because it is the same for all classes.
Example Dataset
Suppose we have the following training emails:
| Text | Class | |
|---|---|---|
| E1 | free offer | Spam |
| E2 | free winner | Spam |
| E3 | meeting schedule | Not Spam |
| E4 | project meeting | Not Spam |
We want to classify a new email:
free meeting
Step 1: Create Vocabulary
Vocabulary means the list of unique words in the training data.
Training texts:
free offer
free winner
meeting schedule
project meeting
Unique words:
free, offer, winner, meeting, schedule, project
So vocabulary size:
V = 6
Step 2: Count Words in Each Class
Spam Emails
Spam emails:
free offer
free winner
Word counts:
| Word | Count in Spam |
|---|---|
| free | 2 |
| offer | 1 |
| winner | 1 |
| meeting | 0 |
| schedule | 0 |
| project | 0 |
Total words in Spam:
2 + 1 + 1 = 4
Not Spam Emails
Not Spam emails:
meeting schedule
project meeting
Word counts:
| Word | Count in Not Spam |
|---|---|
| free | 0 |
| offer | 0 |
| winner | 0 |
| meeting | 2 |
| schedule | 1 |
| project | 1 |
Total words in Not Spam:
2 + 1 + 1 = 4
Step 3: Calculate Prior Probabilities
Total emails:
4
Spam emails:
2
Not Spam emails:
2
Therefore:
P(Spam) = 2/4 = 0.5
P(Not Spam) = 2/4 = 0.5
Step 4: Problem of Zero Probability
New email:
free meeting
To calculate Spam probability:
P(Spam | free meeting) ∝ P(Spam) × P(free|Spam) × P(meeting|Spam)
But:
P(meeting|Spam) = 0
because the word meeting never appeared in Spam emails.
If any probability becomes zero, the full multiplication becomes zero.
This is a problem.
To solve this, we use:
Laplace Smoothing
Step 5: Laplace Smoothing
Laplace smoothing adds 1 to every word count.
Formula:
P(word | class) = (count(word in class) + 1) / (total words in class + vocabulary size)
Where:
Vocabulary size = 6
Total words in each class = 4
So denominator:
4 + 6 = 10
Step 6: Calculate Word Probabilities for Spam
P(free | Spam)
P(free|Spam) = (count(free in Spam) + 1) / (total Spam words + V)
= (2 + 1) / (4 + 6)
= 3 / 10
= 0.3
P(meeting | Spam)
P(meeting|Spam) = (0 + 1) / (4 + 6)
= 1 / 10
= 0.1
Step 7: Calculate Word Probabilities for Not Spam
P(free | Not Spam)
P(free|NotSpam) = (0 + 1) / (4 + 6)
= 1 / 10
= 0.1
P(meeting | Not Spam)
P(meeting|NotSpam) = (2 + 1) / (4 + 6)
= 3 / 10
= 0.3
Step 8: Calculate Final Class Probabilities
New email:
free meeting
Spam Score
P(Spam | free meeting) ∝ P(Spam) × P(free|Spam) × P(meeting|Spam)
= 0.5 × 0.3 × 0.1
= 0.015
Not Spam Score
P(NotSpam | free meeting) ∝ P(NotSpam) × P(free|NotSpam) × P(meeting|NotSpam)
= 0.5 × 0.1 × 0.3
= 0.015
Step 9: Prediction
| Class | Score |
|---|---|
| Spam | 0.015 |
| Not Spam | 0.015 |
Both scores are equal.
So the classifier is uncertain.
This happens because:
free
is strongly related to Spam, while:
meeting
is strongly related to Not Spam.
To avoid tie, let us test another email.
Example 2: Classify "free winner"
New email:
free winner
Spam Score
P(Spam | free winner) ∝ P(Spam) × P(free|Spam) × P(winner|Spam)
P(free|Spam) = 0.3
P(winner|Spam) = (1 + 1) / 10 = 0.2
Spam Score = 0.5 × 0.3 × 0.2
= 0.03
Not Spam Score
P(NotSpam | free winner) ∝ P(NotSpam) × P(free|NotSpam) × P(winner|NotSpam)
P(free|NotSpam) = 0.1
P(winner|NotSpam) = (0 + 1) / 10 = 0.1
Not Spam Score = 0.5 × 0.1 × 0.1
= 0.005
Final Prediction
| Class | Score |
|---|---|
| Spam | 0.03 |
| Not Spam | 0.005 |
Since:
0.03 > 0.005
Prediction:
Spam
How Multinomial Naive Bayes Works
Training Text Data
↓
Create Vocabulary
↓
Count Word Frequencies per Class
↓
Calculate Prior Probabilities
↓
Apply Laplace Smoothing
↓
Calculate Word Likelihoods
↓
Multiply Probabilities
↓
Choose Class with Highest Score
Python Implementation
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
# Training data
emails = [
"free offer",
"free winner",
"meeting schedule",
"project meeting"
]
labels = [
"Spam",
"Spam",
"Not Spam",
"Not Spam"
]
# Convert text into word count vectors
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(emails)
# Create Multinomial Naive Bayes model
model = MultinomialNB()
# Train model
model.fit(X, labels)
# Test emails
test_emails = [
"free meeting",
"free winner"
]
X_test = vectorizer.transform(test_emails)
predictions = model.predict(X_test)
for email, prediction in zip(test_emails, predictions):
print(email, "->", prediction)
Output:
free meeting -> Not Spam or Spam depending on tie handling
free winner -> Spam
Advantages
-
Very effective for text classification
-
Works well with word count features
-
Fast training and prediction
-
Handles high-dimensional sparse data
-
Simple and easy to implement
Limitations
-
Works only with non-negative feature counts
-
Assumes word independence
-
Sensitive to rare words
-
Not suitable for continuous numerical features
Applications
Spam Detection
Sentiment Analysis
News Classification
Document Categorization
Language Detection
Topic Classification
Important Points
-
Multinomial Naive Bayes is used for count-based features.
-
It is widely used in text classification.
-
Features usually represent word frequencies.
-
It uses Bayes Theorem for classification.
-
Laplace smoothing avoids zero probability.
-
The class with the highest probability score is selected.
-
It works well with CountVectorizer and TF-IDF features.
-
It is not suitable for continuous numerical data.
Keywords
Multinomial Naive Bayes, Text Classification, Word Count Features, CountVectorizer, Spam Detection, Laplace Smoothing, Naive Bayes Classifier, Document Classification, Machine Learning NLP, Probabilistic Classifier
The basic formula is: