Multinomial Naive Bayes - Machine Learning

Multinomial Naive Bayes is a type of Naive Bayes algorithm mainly used when features represent counts or frequencies.

It is commonly used in text classification problems such as:

Spam Detection
Sentiment Analysis
News Classification
Document Categorization

In text data, features usually represent:

How many times a word appears in a document

That is why Multinomial Naive Bayes is very useful for text classification.

Why Do We Need Multinomial Naive Bayes?

In Gaussian Naive Bayes, we used continuous numerical values such as:

Height
Weight
Age
Salary

But in text classification, the data is different.

Example email:

"free free offer winner"

Here, we count word frequency:

Word	Count
free	2
offer	1
winner	1

So the features are not continuous values. They are word counts.

For this type of data, we use:

Multinomial Naive Bayes

Core Idea

Multinomial Naive Bayes predicts the class by checking:

Which class is more likely to generate these word counts?

For example, if an email contains:

free offer winner

Multinomial Naive Bayes checks:

P(Spam | free, offer, winner)

and

P(Not Spam | free, offer, winner)

Then it chooses the class with the higher probability.

Bayes Theorem in Multinomial Naive Bayes

The basic formula is:

P(A|B) = P(B|A) P(A) / P(B)

For classification, we use:

P(Class | Words) ∝ P(Class) × P(word1 | Class) × P(word2 | Class) × ... × P(wordn | Class)

The denominator is ignored during comparison because it is the same for all classes.

Example Dataset

Suppose we have the following training emails:

Email	Text	Class
E1	free offer	Spam
E2	free winner	Spam
E3	meeting schedule	Not Spam
E4	project meeting	Not Spam

We want to classify a new email:

free meeting

Step 1: Create Vocabulary

Vocabulary means the list of unique words in the training data.

Training texts:

free offer
free winner
meeting schedule
project meeting

Unique words:

free, offer, winner, meeting, schedule, project

So vocabulary size:

V = 6

Step 2: Count Words in Each Class

Spam Emails

Spam emails:

free offer
free winner

Word counts:

Word	Count in Spam
free	2
offer	1
winner	1
meeting	0
schedule	0
project	0

Total words in Spam:

2 + 1 + 1 = 4

Not Spam Emails

Not Spam emails:

meeting schedule
project meeting

Word counts:

Word	Count in Not Spam
free	0
offer	0
winner	0
meeting	2
schedule	1
project	1

Total words in Not Spam:

2 + 1 + 1 = 4

Step 3: Calculate Prior Probabilities

Total emails:

Spam emails:

Not Spam emails:

Therefore:

P(Spam) = 2/4 = 0.5

P(Not Spam) = 2/4 = 0.5

Step 4: Problem of Zero Probability

New email:

free meeting

To calculate Spam probability:

P(Spam | free meeting) ∝ P(Spam) × P(free|Spam) × P(meeting|Spam)

But:

P(meeting|Spam) = 0

because the word meeting never appeared in Spam emails.

If any probability becomes zero, the full multiplication becomes zero.

This is a problem.

To solve this, we use:

Laplace Smoothing

Step 5: Laplace Smoothing

Laplace smoothing adds 1 to every word count.

Formula:

P(word | class) = (count(word in class) + 1) / (total words in class + vocabulary size)

Where:

Vocabulary size = 6
Total words in each class = 4

So denominator:

4 + 6 = 10

Step 6: Calculate Word Probabilities for Spam

P(free | Spam)

P(free|Spam) = (count(free in Spam) + 1) / (total Spam words + V)

= (2 + 1) / (4 + 6)

= 3 / 10

= 0.3

P(meeting | Spam)

P(meeting|Spam) = (0 + 1) / (4 + 6)

= 1 / 10

= 0.1

Step 7: Calculate Word Probabilities for Not Spam

P(free | Not Spam)

P(free|NotSpam) = (0 + 1) / (4 + 6)

= 1 / 10

= 0.1

P(meeting | Not Spam)

P(meeting|NotSpam) = (2 + 1) / (4 + 6)

= 3 / 10

= 0.3

Step 8: Calculate Final Class Probabilities

New email:

free meeting

Spam Score

P(Spam | free meeting) ∝ P(Spam) × P(free|Spam) × P(meeting|Spam)

= 0.5 × 0.3 × 0.1

= 0.015

Not Spam Score

P(NotSpam | free meeting) ∝ P(NotSpam) × P(free|NotSpam) × P(meeting|NotSpam)

= 0.5 × 0.1 × 0.3

= 0.015

Step 9: Prediction

Class	Score
Spam	0.015
Not Spam	0.015

Both scores are equal.

So the classifier is uncertain.

This happens because:

free

is strongly related to Spam, while:

meeting

is strongly related to Not Spam.

To avoid tie, let us test another email.

Example 2: Classify "free winner"

New email:

free winner

Spam Score

P(Spam | free winner) ∝ P(Spam) × P(free|Spam) × P(winner|Spam)

P(free|Spam) = 0.3

P(winner|Spam) = (1 + 1) / 10 = 0.2

Spam Score = 0.5 × 0.3 × 0.2

= 0.03

Not Spam Score

P(NotSpam | free winner) ∝ P(NotSpam) × P(free|NotSpam) × P(winner|NotSpam)

P(free|NotSpam) = 0.1

P(winner|NotSpam) = (0 + 1) / 10 = 0.1

Not Spam Score = 0.5 × 0.1 × 0.1

= 0.005

Final Prediction

Class	Score
Spam	0.03
Not Spam	0.005

Since:

0.03 > 0.005

Prediction:

Spam

How Multinomial Naive Bayes Works

Training Text Data
        ↓
Create Vocabulary
        ↓
Count Word Frequencies per Class
        ↓
Calculate Prior Probabilities
        ↓
Apply Laplace Smoothing
        ↓
Calculate Word Likelihoods
        ↓
Multiply Probabilities
        ↓
Choose Class with Highest Score

Python Implementation

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB

# Training data
emails = [
    "free offer",
    "free winner",
    "meeting schedule",
    "project meeting"
]

labels = [
    "Spam",
    "Spam",
    "Not Spam",
    "Not Spam"
]

# Convert text into word count vectors
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(emails)

# Create Multinomial Naive Bayes model
model = MultinomialNB()

# Train model
model.fit(X, labels)

# Test emails
test_emails = [
    "free meeting",
    "free winner"
]

X_test = vectorizer.transform(test_emails)

predictions = model.predict(X_test)

for email, prediction in zip(test_emails, predictions):
    print(email, "->", prediction)

Output:

free meeting -> Not Spam or Spam depending on tie handling
free winner -> Spam

Advantages

Very effective for text classification
Works well with word count features
Fast training and prediction
Handles high-dimensional sparse data
Simple and easy to implement

Limitations

Works only with non-negative feature counts
Assumes word independence
Sensitive to rare words
Not suitable for continuous numerical features

Applications

Spam Detection
Sentiment Analysis
News Classification
Document Categorization
Language Detection
Topic Classification

Important Points

Multinomial Naive Bayes is used for count-based features.
It is widely used in text classification.
Features usually represent word frequencies.
It uses Bayes Theorem for classification.
Laplace smoothing avoids zero probability.
The class with the highest probability score is selected.
It works well with CountVectorizer and TF-IDF features.
It is not suitable for continuous numerical data.

Keywords

Multinomial Naive Bayes, Text Classification, Word Count Features, CountVectorizer, Spam Detection, Laplace Smoothing, Naive Bayes Classifier, Document Classification, Machine Learning NLP, Probabilistic Classifier

The basic formula is: