Is HYVE CARES really free?

Yes. 100% free, forever. Every feature, every lab, every lesson. The only paid add-on is the optional Homeschool Compliance Program ($10/month) for families who need legal compliance tools.

Can I use HYVE CARES for homeschooling?

Yes. HYVE CARES provides a complete K-12 curriculum plus a dedicated Homeschool Compliance Program with attendance tracking, immunization records, standardized test management, and transcript generation — available in all 50 US states.

What subjects does HYVE CARES cover?

200+ subjects including Math, Science, Language Arts, Social Studies, Coding, 18 world languages, Financial Literacy, Music, Art, Career Readiness, and more — aligned with Common Core and NGSS standards.

Does HYVE CARES have practice exams?

Yes. 30+ practice exams including SAT, ACT, GRE, LSAT, MCAT, ASVAB, CompTIA A+, Real Estate, CDL, and more — with timed testing, AI-powered scoring, percentile estimates, and spaced repetition study mode.

MaXXiE is HYVE CARES' AI tutoring system — a personalized learning companion that adapts to each student, generates lessons on demand, scans homework, and provides voice-based learning.

Is HYVE CARES safe for children?

Yes. HYVE CARES requires parental consent for children under 13 (in line with COPPA), stores student data with Row-Level Security and AES-256 encryption at rest, and never sells data or shows ads.

Backpropagation

Gradient descent needs the gradient dL/dw for every weight w in the network. In a deep network with millions of weights, computing these gradients by brute force — perturbing each weight individually and measuring the change in loss — would be impossibly slow. Backpropagation is the algorithm that computes all gradients simultaneously and efficiently, in one backward pass through the network. It rests entirely on the chain rule of calculus. This lesson derives it carefully, with a worked numerical example.

The Chain Rule, Reviewed

The chain rule states: if y depends on u, and u depends on x, then: dy/dx = (dy/du) * (du/dx) For a longer chain — y depends on u, u depends on v, v depends on x: dy/dx = (dy/du) * (du/dv) * (dv/dx) This is the engine of backpropagation. The loss L depends on the output of the last layer, which depends on the previous layer, which depends on the one before — a long chain of functions. Backpropagation applies the chain rule to decompose dL/dw^(l) into a product of local derivatives at each layer, computed efficiently.

The Delta (Error Signal)

Define delta^(l) as the gradient of the loss with respect to the pre-activation z^(l) at layer l. It is called the error signal or local gradient. Backpropagation works by computing delta^(L) at the output layer first, then propagating it backward: delta^(l) is computed from delta^(l+1) times the weights W^(l+1) times the local derivative of the activation function. Once you have delta^(l), the gradient of L with respect to W^(l) is simply the outer product of delta^(l) and h^(l-1).

Full derivation for a two-layer network. Setup: input x, hidden layer h = sigma(z^(1)), output y_hat = sigma(z^(2)), loss L = (y - y_hat)^2. Step 1: Forward pass (already know how to do this). z^(1) = W^(1) x + b^(1); h = sigma(z^(1)) z^(2) = W^(2) h + b^(2); y_hat = sigma(z^(2)) L = (y - y_hat)^2 Step 2: Output layer error signal (delta^(2)). dL/dy_hat = -2(y - y_hat) dy_hat/dz^(2) = sigma'(z^(2)) delta^(2) = dL/dz^(2) = (dL/dy_hat)(dy_hat/dz^(2)) = -2(y - y_hat) * sigma'(z^(2)) Step 3: Gradient with respect to W^(2). dL/dW^(2) = delta^(2) * h^T [outer product] Step 4: Backpropagate to layer 1. delta^(1) = (W^(2))^T delta^(2) * sigma'(z^(1)) [element-wise multiply by local activation derivative] Step 5: Gradient with respect to W^(1). dL/dW^(1) = delta^(1) * x^T Numerical example (continuing from Lesson 4 values, scalar output for simplicity): y = 1 (true label), y_hat = 0.670 (network output), sigma = sigmoid. Step 2: dL/dy_hat = -2(1 - 0.670) = -2(0.330) = -0.660 sigma'(z^(3)) = y_hat*(1-y_hat) = 0.670*0.330 = 0.221 delta^(3) = -0.660 * 0.221 = -0.146 Step 3: dL/dW^(3) = delta^(3) * h^(2)^T = -0.146 * [0.6, 0.14] = [-0.0876, -0.0204] With eta = 0.1, the update for W^(3) = [1.2, -0.8] becomes: W^(3)_new = [1.2 - 0.1*(-0.0876), -0.8 - 0.1*(-0.0204)] = [1.2 + 0.00876, -0.8 + 0.00204] = [1.20876, -0.79796] The weights shifted slightly toward values that increase the output — correctly, since we want the output closer to 1.

Why Backpropagation Is Efficient

Without backpropagation, computing dL/dw_i for each weight w_i would require re-running the entire forward pass twice (once with w_i, once with w_i + epsilon) — two forward passes per weight. With a million weights, that is two million forward passes per gradient step. Backpropagation computes all gradients in one backward pass, which costs roughly the same as one forward pass. This is a million-fold speedup in a million-parameter network. The reason is reuse. When backpropagating, each intermediate delta^(l) is computed once and reused for all weights in that layer. The chain rule products are accumulated rather than recomputed from scratch for each weight. Backpropagation was introduced in the context of neural networks by Rumelhart, Hinton, and Williams in 1986, in a landmark Nature paper. It made training deep networks computationally feasible and is still the algorithm running inside every modern deep learning framework.

Match each backpropagation step to what it computes.

Terms

Forward pass

Output error delta^(L)

Backpropagation to layer l

Gradient of W^(l)

Weight update

Definitions

Computes predictions and all intermediate activations h^(l)

Gradient of loss with respect to the pre-activation at the last layer

Outer product of delta^(l) and the activations from the previous layer h^(l-1)

Multiplies delta^(l+1) by W^(l+1) transposed and scales by activation derivative

Subtracts learning rate times gradient from each weight after all gradients are collected

Drag terms onto their definitions, or click a term then click a definition to match.

Vanishing Gradients Revisited

Backpropagation multiplies many terms together via the chain rule. If the activation derivative sigma'(z) is small (as it is in saturation regions of sigmoid or tanh), then delta^(l) shrinks exponentially as l decreases. By the time the gradient reaches early layers, it may be so small that those layers barely update. This is the vanishing gradient problem from the activation function perspective — and it explains again why ReLU, with its gradient of 1 for positive activations, dramatically helps deep networks train.

During backpropagation, what does multiplying delta^(l+1) by (W^(l+1))^T achieve?

A deep network has 50 layers all using sigmoid activations. The derivative of sigmoid is at most 0.25. What happens to the gradient at layer 1, starting from a gradient of 1 at layer 50?

Backpropagation by Hand

Step 1: Build the simplest possible network: one weight w, one bias b = 0, no hidden layer. The network computes y_hat = w * x.
Step 2: Use x = 3, y = 6 (true value), and start with w = 1. Loss L = (y - y_hat)^2.
Step 3: Compute y_hat = 1 * 3 = 3. Compute L = (6-3)^2 = 9.
Step 4: Compute dL/dy_hat = -2(6-3) = -6.
Step 5: Compute dy_hat/dw = x = 3 (since y_hat = w*x, the derivative with respect to w is x).
Step 6: By chain rule: dL/dw = (dL/dy_hat)*(dy_hat/dw) = (-6)*(3) = -18.
Step 7: Update: w_new = w - eta * (dL/dw) = 1 - 0.1*(-18) = 1 + 1.8 = 2.8.
Step 8: Compute y_hat_new = 2.8 * 3 = 8.4. Loss_new = (6-8.4)^2 = 5.76. Did loss decrease?
Step 9: Continue for two more steps. What value does w approach?