Parameters and the Hypothesis Space
In Lesson 1 we said that ML searches for a function. But a search needs a search space — a clearly defined collection of candidates. Where do these candidate functions come from, and how many are there? The answer lies in the concept of a parameter, and in the resulting structure called the hypothesis space. Understanding both is essential to reasoning about what any model can and cannot learn.
Parameters: The Adjustable Knobs
A parameter is a numerical value inside a model that is adjusted during training. The model's architecture (its structural blueprint) stays fixed; only the parameter values change. Consider the simplest possible model: a line in two dimensions. A line is defined by two parameters: slope m and intercept b, giving f(x) = mx + b. With m = 2 and b = 1, the function is f(x) = 2x + 1. Change m to −0.5 and b to 7, and you have an entirely different function f(x) = −0.5x + 7. Same architecture (a line), different parameters, different function. A modern large language model has billions of parameters — GPT-4 is estimated at roughly 1.8 trillion. Each parameter is a single floating-point number. Training is the process of finding numerical values for all of them such that the resulting function performs well on the task. Parameters are sometimes called weights, especially in neural networks, because they literally weight the influence of each input on the output.
A parameter is a numerical value that is part of the model's function definition and is learned from data. The full set of parameter values completely specifies which function the model implements. Before training, parameters are typically initialized randomly; after training, they encode everything the model has learned.
To make this concrete, consider a linear model for predicting apartment rent from three features: square footage (x1), number of bedrooms (x2), and distance to transit in miles (x3). The model is: rent = w1 * x1 + w2 * x2 + w3 * x3 + b Here w1, w2, w3, and b are four parameters. Suppose training yields w1 = 1.85, w2 = 220, w3 = −85, b = 400. Plugging in an apartment with 900 sq ft, 2 bedrooms, 0.4 miles from transit: rent = 1.85(900) + 220(2) + (−85)(0.4) + 400 = 1665 + 440 − 34 + 400 = 2471 dollars per month Every parameter carries meaning — w3 being negative says that being farther from transit lowers predicted rent, which aligns with real-world intuition. When you examine trained parameters, you are reading what the model has learned about the world.
Flashcards — click each card to reveal the answer
The Hypothesis Space: All Possible Functions
The hypothesis space H is the set of all functions the model's architecture can represent — one function for every possible assignment of values to its parameters. Because parameters are continuous real numbers, the hypothesis space of even a two-parameter linear model is infinite: there are infinitely many (m, b) pairs, so infinitely many lines. Why does the shape of the hypothesis space matter? Because a model can only learn functions that lie inside its hypothesis space. A linear model (straight lines in 2D) cannot represent a parabola, no matter how cleverly you choose m and b. If the true function you are trying to approximate is nonlinear, a linear model is permanently constrained — the best it can do is a linear approximation. This is why architecture choice is so important. Choosing an architecture means choosing the hypothesis space — deciding which kinds of functions are even candidates for learning. A two-layer neural network with nonlinear activations can represent a far richer set of functions than a linear model. A convolutional neural network encodes the architectural prior that spatial relationships in images matter. The expressive power of a hypothesis space and the risk of overfitting to noise are fundamentally in tension — a theme that will recur throughout this module.
A larger hypothesis space means the model can fit more complex patterns — but it also means the model can fit noise and irrelevant quirks in the training data. Choosing an architecture with far more parameters than your problem requires is not neutral; it is a liability.
A team switches from a linear model (2 parameters) to a polynomial model of degree 5 (6 parameters) for the same dataset. What happens to the hypothesis space?
A researcher trains a linear model on a dataset where the true relationship is a sine wave. Even with unlimited data and perfect optimization, what is the best possible outcome?
Explore the Hypothesis Space of a Line
- You will discover visually how parameter choices determine functions.
- Step 1: On graph paper (or a coordinate system drawn by hand), draw x from −5 to 5 and y from −10 to 10.
- Step 2: For each of the following (m, b) pairs, draw the line f(x) = mx + b:
- (a) m=1, b=0 (b) m=2, b=−3 (c) m=−1, b=5 (d) m=0, b=2
- Step 3: You have just drawn four hypotheses from the hypothesis space of linear models. Can you draw a hypothesis from this space that passes through the points (0, 0) and (2, 4)? Find the parameters.
- Step 4: Can you draw a hypothesis from this space that passes through (0,0), (1,1), and (2,3) simultaneously? Why or why not?
- Discuss: What does your answer to Step 4 tell you about the limits of a hypothesis space?