Count Vectorizer

CountVectorizer in NLP

Introduction

When working with text data in Machine Learning or NLP (Natural Language Processing), one of the first challenges is:

How can we convert text into numbers that a machine can understand?

Computers cannot understand sentences like:

"I love Python"
"Python is easy"
"I love coding"

Machine learning algorithms work with numbers, not text.

This is where CountVectorizer comes into the picture.

CountVectorizer converts text documents into numerical vectors by counting how many times each word appears.

It is available in the sklearn.feature_extraction.text module of the scikit-learn library.

What is CountVectorizer?

Definition:

CountVectorizer converts a collection of text documents into a matrix of token counts.

In simple words:

  • Extract all unique words

  • Create a vocabulary

  • Count how many times each word appears

  • Represent each document as numerical data

Why Do We Need CountVectorizer?

Consider the following dataset:

documents = [
"I love Python",
"Python is easy",
"I love coding"
]

Machine learning algorithms cannot process these strings directly.

We need something like:

coding easy is love python
0 0 0 1 1
0 1 1 0 1
1 0 0 1 0

Now the text has been converted into numbers.

This numerical representation can be used for:

  • Sentiment Analysis

  • Spam Detection

  • Text Classification

  • Recommendation Systems

  • Chatbots

  • Search Engines

How CountVectorizer Works

CountVectorizer works in three major steps:

Step 1: Tokenization

Break text into individual words.

Example:

I love Python

becomes:

["I", "love", "Python"]

Step 2: Create Vocabulary

Collect all unique words from all documents.

Documents:

[
"I love Python",
"Python is easy",
"I love coding"
]

Unique words:

coding
easy
is
love
python

Vocabulary:

Word Index
coding 0
easy 1
is 2
love 3
python 4

Step 3: Count Frequencies

Now CountVectorizer counts occurrences of each word in every document.

Document 1:

I love Python

Counts:

coding = 0
easy = 0
is = 0
love = 1
python = 1

Vector:

[0, 0, 0, 1, 1]

Document 2:

Python is easy

Vector:

[0, 1, 1, 0, 1]

Document 3:

I love coding

Vector:

[1, 0, 0, 1, 0]

Final Matrix:

Document coding easy is love python
I love Python 0 0 0 1 1
Python is easy 0 1 1 0 1
I love coding 1 0 0 1 0

This is called a Document-Term Matrix (DTM).

Python Program

from sklearn.feature_extraction.text import CountVectorizer

documents = [
"I love Python",
"Python is easy",
"I love coding"
]

vectorizer = CountVectorizer()

X = vectorizer.fit_transform(documents)

print("Vocabulary:")
print(vectorizer.vocabulary_)

print("\nFeatures:")
print(vectorizer.get_feature_names_out())

print("\nCount Matrix:")
print(X.toarray())

Output:

Vocabulary:
{
'love':3,
'python':4,
'is':2,
'easy':1,
'coding':0
}

Features:
['coding' 'easy' 'is' 'love' 'python']

Count Matrix:

[
[0 0 0 1 1]
[0 1 1 0 1]
[1 0 0 1 0]
]

Previous Topic Miscellaneous Next Topic ML Projects