k-Nearest Neighbors
Suppose you are new to a city and trying to decide whether a neighborhood restaurant is worth visiting. You poll the five people whose opinions you trust most who live nearest to you. Four of them love it; one is indifferent. You decide to go. This is, essentially, k-Nearest Neighbors: find the k training examples most similar to the new input, and let their labels determine the prediction. It is perhaps the most conceptually transparent algorithm in supervised learning — and an ideal lens through which to examine what 'similarity' really means.
The Algorithm
k-NN is a lazy learning algorithm — it does no explicit training phase. Instead, it stores the entire training dataset. When a new input x arrives, it: 1. Computes the distance from x to every training point. 2. Identifies the k closest training points (the k nearest neighbors). 3. For classification: predicts the majority class among the k neighbors. For regression: predicts the mean label among the k neighbors. A concrete classification example: Training set (two features: hours_studied, hours_slept): Student A: (6, 8) → Passed Student B: (2, 5) → Failed Student C: (5, 7) → Passed Student D: (1, 6) → Failed Student E: (4, 8) → Passed New student: x = (4.5, 7.5). Use k=3. Compute Euclidean distances: d(x, A) = √((6-4.5)² + (8-7.5)²) = √(2.25 + 0.25) = √2.5 ≈ 1.58 d(x, B) = √((2-4.5)² + (5-7.5)²) = √(6.25 + 6.25) = √12.5 ≈ 3.54 d(x, C) = √((5-4.5)² + (7-7.5)²) = √(0.25 + 0.25) = √0.5 ≈ 0.71 d(x, D) = √((1-4.5)² + (6-7.5)²) = √(12.25 + 2.25) = √14.5 ≈ 3.81 d(x, E) = √((4-4.5)² + (8-7.5)²) = √(0.25 + 0.25) = √0.5 ≈ 0.71 Three nearest: C (0.71), E (0.71), A (1.58). Labels: Passed, Passed, Passed. Prediction: Passed (3-0 majority).
k-NN has almost no mathematical assumptions about the shape of the decision boundary. The assumption is entirely in the distance metric: it assumes that similar inputs (close in distance) tend to have similar labels. If that assumption holds, k-NN works. If the feature space has many irrelevant dimensions, or features at very different scales, that assumption breaks — and so does the algorithm.
The choice of k is critical and controls the bias-variance trade-off: k=1: The prediction equals the label of the single nearest neighbor. The decision boundary is extremely jagged — it will perfectly classify every training point (zero training error) but will be very sensitive to noise. High variance, low bias. Large k (e.g., k=50): The prediction is the majority among 50 neighbors, which smooths out noise but may blur true class boundaries. Low variance, higher bias. Practical guidance: try k values from 1 to ~20 using cross-validation and select the one that minimizes validation error. k should be odd for binary classification to avoid ties. Distance metrics matter enormously: Euclidean distance: d(x, z) = √Σ(xⱼ - zⱼ)². The most common choice, appropriate when features are on comparable scales and directions are equally meaningful. Manhattan distance: d(x, z) = Σ|xⱼ - zⱼ|. Sums absolute differences; less sensitive to large outliers in single dimensions. Cosine similarity: measures the angle between feature vectors. Common in text where direction (which words appear) matters more than magnitude (how many times).
If feature x₁ ranges from 0 to 1 and feature x₂ ranges from 0 to 10,000, Euclidean distance is almost entirely determined by x₂. A feature you believe is important but happens to have small numerical range will be ignored. Always standardize features before applying k-NN: subtract the mean and divide by the standard deviation so all features contribute comparably.
Complete the statement about k-NN classification.
Computational Cost and the Curse of Dimensionality
k-NN's practical limitation is computational. With N training points and D features, each prediction requires computing N distances, each involving D operations — a cost of O(N × D) per query. With millions of training examples and thousands of features, this is prohibitively slow. Approximate nearest-neighbor structures (KD-trees, ball trees) can reduce query time in low dimensions, but these data structures degrade as dimensionality increases. The curse of dimensionality is a deeper problem. In high-dimensional spaces, all points tend to be roughly the same distance from each other. Intuition: in 1D, the k nearest neighbors to a point are genuinely close. In 1,000 dimensions, the 'nearest' neighbor might be nearly as far away as the 'farthest' — the notion of locality collapses. The distances become uninformative, and k-NN stops working well. This is why k-NN is most effective with: - Datasets with fewer than a few thousand features - Features that have been carefully selected and scaled - Datasets small enough that O(N) queries are manageable - Problems where the true decision boundary is complex and non-parametric, but the relevant dimensions are few
A k-NN regression model with k=5 is used to predict house prices. The five nearest neighbors have prices: $280k, $310k, $295k, $400k, $275k. What is the prediction?
Why must features be standardized before using k-NN, but NOT before using a decision tree?
The Curse of Dimensionality Simulation
- This unplugged activity builds intuition for why high-dimensional spaces are strange.
- Step 1 (1D): On a number line from 0 to 10, mark 5 random points. Pick a query point and identify the 2 nearest. How far away is the nearest? Is 'nearest' meaningful?
- Step 2 (2D): On a 10x10 grid, mark 5 random points. Pick a query point. Compute Euclidean distances to all 5 points and find the 2 nearest. Compare the distance to the nearest vs. the farthest.
- Step 3 (thought experiment — 100D): Imagine 5 points scattered in a 100-dimensional cube of side length 10. Each coordinate is random between 0 and 10. Estimate the expected Euclidean distance between two random points using the formula: E[d] ≈ 10 × √(D/3) ≈ 10 × √33 ≈ 57.7.
- Step 4: Compare the expected distances in 1D, 2D, and 100D. As D grows, what happens to the ratio of (nearest neighbor distance) to (farthest neighbor distance)?
- Step 5: Discuss — if all points are nearly equally far away, what does that mean for a k-NN classifier trying to find 'similar' examples?