Dimensionality Reduction
Real-world datasets are high-dimensional. A genome sequencing experiment might track 20,000 gene expression levels per patient. An image is a vector of millions of pixel values. A bag-of-words text representation might have 100,000 features — one per word in the vocabulary. These high-dimensional spaces are not just large; they behave in counterintuitive ways that make learning harder. Dimensionality reduction is the set of techniques that compress those many features into far fewer while preserving what matters.
The Curse of Dimensionality
As the number of features (dimensions) grows, data becomes exponentially sparser. Consider trying to cover a line with evenly spaced points — 10 points cover a 1D line reasonably well. To cover a 10-dimensional hypercube at the same density, you need 10¹⁰ points. Most machine learning algorithms implicitly assume that nearby points in feature space are similar; sparsity breaks this assumption because 'nearby' effectively ceases to exist — in high dimensions, all points are roughly equidistant from each other. This has practical consequences: - Models have more parameters to fit, requiring more data. - Distance-based algorithms (k-means, k-nearest-neighbors) lose discriminative power. - Visualizing data becomes impossible beyond 3 dimensions. - Training is slower and more memory-intensive. Dimensionality reduction attacks all these problems at once by replacing d original features with k new features where k << d.
Most real datasets do not actually have as many truly independent dimensions as they have features. A thousand photos of faces vary in lighting, pose, expression, and identity — but those variations live on a low-dimensional manifold inside the million-dimensional pixel space. Dimensionality reduction finds and exploits that lower-dimensional structure. The true number of independent degrees of variation is called the intrinsic dimensionality.
Principal Component Analysis (PCA) — the most foundational dimensionality reduction algorithm: PCA finds the directions (principal components) in the original feature space along which the data varies the most, then projects the data onto those directions. Formal steps: 1. Center the data: subtract the mean from every feature so the data is centered at the origin. 2. Compute the covariance matrix: a d×d matrix where entry (i,j) measures how much features i and j vary together. 3. Find eigenvectors and eigenvalues: the eigenvectors of the covariance matrix are the principal components. The eigenvalue associated with each eigenvector indicates how much variance lies in that direction. 4. Sort by eigenvalue: the first principal component (PC1) explains the most variance; PC2 explains the second most, orthogonal to PC1; and so on. 5. Project: multiply the centered data by the top-k eigenvectors to get a k-dimensional representation. Concrete 2D → 1D example: 5 points — (2,4), (3,6), (4,8), (5,10), (6,12). These points lie perfectly on the line y=2x. PCA with k=1 would find that the first principal component runs along that diagonal, and all five points project onto it with zero reconstruction error. The entire dataset is captured in one dimension because the intrinsic dimensionality is 1.
Choosing how many components: A scree plot graphs eigenvalues in descending order. A steep drop followed by a leveling-off (the 'elbow') suggests where to cut. Another approach: choose k such that the top-k components explain a target fraction of total variance (e.g., 95%). If the first two components explain 90% of variance in a 1,000-dimensional dataset, k=2 compresses by a factor of 500 with only 10% of variation lost. Other dimensionality reduction methods: - t-SNE (t-distributed Stochastic Neighbor Embedding): a non-linear technique designed for visualization (k=2 or 3). Preserves local neighborhood structure very well, but distances between distant clusters are not meaningful — t-SNE is for visualization, not for feeding into downstream classifiers. - UMAP (Uniform Manifold Approximation and Projection): faster than t-SNE, better at preserving both local and global structure, increasingly preferred for exploratory analysis. - Autoencoders: neural networks trained to compress inputs into a bottleneck layer and then reconstruct them. The bottleneck activations are the reduced representation.
Complete the following statements about PCA.
Why Dimensionality Reduction Helps Downstream Learning
Dimensionality reduction is not just about compression — it actively improves downstream tasks: 1. Noise removal: small eigenvalues often correspond to noise rather than signal. Dropping those components removes noise before clustering or classification. 2. Visualization and hypothesis generation: projecting to 2D lets researchers visually inspect whether groups naturally separate — discovering structure before running formal clustering. 3. Computational efficiency: training a classifier on 20 features instead of 20,000 is vastly faster and uses far less memory. 4. Avoiding the curse of dimensionality: sparse high-dimensional space becomes dense low-dimensional space, restoring the assumption that proximity implies similarity. Important caveat: the new dimensions produced by PCA and t-SNE are not the original features — they are linear (or nonlinear) combinations. Interpretability can be lost. A principal component might correspond to 'gene expression patterns that co-vary across patients' rather than any single gene — meaningful, but not immediately readable.
t-SNE is extraordinarily good at revealing cluster structure visually. However, the distances between clusters in a t-SNE plot do not reflect true similarity in the original space, and cluster sizes are also distorted. Never use t-SNE output distances for downstream analysis. Use it only for qualitative visual exploration.
A genomics researcher applies PCA to a 20,000-gene expression dataset and finds that the first 5 principal components explain 87% of total variance. What does this tell her?
Why should you NOT feed t-SNE embeddings as features into a downstream classifier?
Variance Budget
- Step 1: Consider a toy dataset: 100 students, each described by 8 features (height, weight, GPA, hours studied per week, commute time, number of siblings, shoe size, resting heart rate).
- Step 2: Without running any algorithm, predict which features you think will be highly correlated with each other. List at least two pairs.
- Step 3: Hypothesize which features will make up the first principal component — which combination of features explains the most variation between students?
- Step 4: If PCA reveals that the first two components explain 70% of variance, what would you conclude about the intrinsic dimensionality of student variation?
- Step 5: If you then cluster students in the 2D PCA space and find three distinct groups, what steps would you take to validate whether those groups are meaningful?