K-Nearest Neighbors
Brief Idea
K-Nearest Neighbors, commonly called KNN, is a supervised machine learning algorithm used for classification and regression.
In classification, KNN predicts the class of a new data point by looking at the classes of its nearest neighbors.
The main idea is:
Similar data points usually belong to similar classes.
Why KNN is Called a Non-Linear Classifier
KNN does not create one fixed straight-line decision boundary like Logistic Regression or Linear SVM.
Instead, it classifies points based on nearby training examples.
Because of this, KNN can create:
Curved boundaries
Irregular boundaries
Complex boundaries
So KNN is considered a non-linear classifier.
Intuition
Suppose you move to a new city and want to know whether an area is expensive or affordable.
You check nearby houses:
Nearest 5 houses
↓
Most are expensive
↓
Your area is likely expensive
KNN works in the same way.
For a new data point:
Find nearest K points
↓
Check their class labels
↓
Choose majority class
Meaning of K in KNN
In KNN, K means the number of nearest neighbors considered for prediction.
Example:
K = 3
means:
Look at 3 nearest training points
If among 3 neighbors:
2 belong to Class A
1 belongs to Class B
Then prediction:
Class A
How KNN Works
Step 1: Store Training Data
KNN does not build a complex model during training.
It simply stores the training data.
That is why KNN is called a:
Lazy Learning Algorithm
Step 2: Take a New Data Point
Suppose a new point needs classification.
Step 3: Calculate Distance
KNN calculates the distance between the new point and all training points.
Common distance metric:
Euclidean Distance
Formula:
Distance = √((x2 - x1)² + (y2 - y1)²)
Step 4: Select K Nearest Neighbors
Sort all distances from smallest to largest.
Select the nearest K points.
Step 5: Majority Voting
Count the class labels among selected K neighbors.
The class with the highest count becomes the prediction.
Example Dataset
| Point | x1 | x2 | Class |
|---|---|---|---|
| A | 1 | 1 | Red |
| B | 2 | 1 | Red |
| C | 1 | 2 | Red |
| D | 5 | 5 | Blue |
| E | 6 | 5 | Blue |
| F | 5 | 6 | Blue |
New point:
P = (3,3)
Let:
K = 3
We need to classify P.
Step 1: Calculate Distance from P to Each Point
Distance formula:
Distance = √((x2 - x1)² + (y2 - y1)²)
Distance from P(3,3) to A(1,1)
Distance = √((3 - 1)² + (3 - 1)²)
= √(2² + 2²)
= √8
= 2.828
Distance from P(3,3) to B(2,1)
Distance = √((3 - 2)² + (3 - 1)²)
= √(1² + 2²)
= √5
= 2.236
Distance from P(3,3) to C(1,2)
Distance = √((3 - 1)² + (3 - 2)²)
= √(2² + 1²)
= √5
= 2.236
Distance from P(3,3) to D(5,5)
Distance = √((3 - 5)² + (3 - 5)²)
= √((-2)² + (-2)²)
= √8
= 2.828
Distance from P(3,3) to E(6,5)
Distance = √((3 - 6)² + (3 - 5)²)
= √((-3)² + (-2)²)
= √13
= 3.606
Distance from P(3,3) to F(5,6)
Distance = √((3 - 5)² + (3 - 6)²)
= √((-2)² + (-3)²)
= √13
= 3.606
Step 2: Sort Distances
| Point | Distance | Class |
|---|---|---|
| B | 2.236 | Red |
| C | 2.236 | Red |
| A | 2.828 | Red |
| D | 2.828 | Blue |
| E | 3.606 | Blue |
| F | 3.606 | Blue |
Step 3: Select K Nearest Neighbors
For:
K = 3
Nearest 3 points are:
| Neighbor | Class |
|---|---|
| B | Red |
| C | Red |
| A | Red |
Step 4: Majority Voting
Among 3 nearest neighbors:
Red = 3
Blue = 0
So prediction:
P(3,3) = Red
How K Value Affects Prediction
Small K
Example:
K = 1
The model only checks the nearest single point.
This can be sensitive to noise.
Problem:
High variance
Overfitting
Large K
Example:
K = 15
The model considers many points.
This can smooth the boundary too much.
Problem:
High bias
Underfitting
Good K
A balanced K gives better classification.
Common practice:
Use odd K values
Examples:
K = 3, 5, 7
Odd K helps avoid ties in binary classification.
Distance Metrics in KNN
1. Euclidean Distance
Most common distance metric.
Used for continuous numerical data.
Straight-line distance between two points
2. Manhattan Distance
Distance measured by moving only horizontally and vertically.
Formula:
|x2 - x1| + |y2 - y1|
Useful in grid-like data.
3. Minkowski Distance
General form of Euclidean and Manhattan distance.
4. Hamming Distance
Used for categorical or binary features.
Why Feature Scaling is Important
KNN depends on distance.
If one feature has very large values, it dominates distance calculation.
Example:
| Feature | Range |
|---|---|
| Age | 18 to 60 |
| Salary | 20,000 to 2,00,000 |
Salary values are much larger, so KNN may ignore age.
To solve this, use:
Standardization
Normalization
KNN Decision Boundary
KNN creates decision boundaries based on local neighborhoods.
Small K:
Complex boundary
Large K:
Smoother boundary
This is why KNN can handle non-linear classification problems.
Advantages
-
Simple to understand
-
No training phase required
-
Works well for small datasets
-
Can handle non-linear boundaries
-
Useful for multi-class classification
Limitations
-
Slow during prediction
-
Sensitive to irrelevant features
-
Sensitive to feature scaling
-
Memory intensive
-
Choosing correct K is important
-
Performs poorly with high-dimensional data
Applications
-
Recommendation systems
-
Image classification
-
Pattern recognition
-
Medical diagnosis
-
Customer segmentation
-
Handwriting recognition
Python Implementation
from sklearn.neighbors import KNeighborsClassifier
X = [
[1, 1],
[2, 1],
[1, 2],
[5, 5],
[6, 5],
[5, 6]
]
y = ["Red", "Red", "Red", "Blue", "Blue", "Blue"]
model = KNeighborsClassifier(n_neighbors=3)
model.fit(X, y)
prediction = model.predict([[3, 3]])
print("Prediction:", prediction)
Output:
Prediction: ['Red']
Important Points
-
KNN stands for K-Nearest Neighbors.
-
KNN is a supervised learning algorithm.
-
KNN is used for classification and regression.
-
It is called a lazy learning algorithm.
-
KNN stores training data instead of learning parameters.
-
Prediction is based on nearest neighbors.
-
K value controls how many neighbors are considered.
-
Majority voting is used for classification.
-
KNN can create non-linear decision boundaries.
-
Feature scaling is very important in KNN.
Keywords
K Nearest Neighbors, KNN Classifier, Non Linear Classifier, Lazy Learning Algorithm, Euclidean Distance, Majority Voting, K Value, Distance Based Algorithm, KNN Decision Boundary, Machine Learning Classification