Chain Rule
Anyone who stumbles across this and has some free time -- the writing style is very informal, uses the first person, etc., and should be cleaned up.
Statement
The Chain Rule is a theorem of calculus which states that if , then wherever those expressions make sense.
For example, if , , and , then .
Here are some more precise statements for the single-variable and multi-variable cases.
Single variable Chain Rule:
Let each of be an open interval, and suppose and . Let such that . If , is differentiable at , and is differentiable at then is differentiable at , and .
Multi-dimensional Chain Rule:
Let and . (Here each of , , and is a positive integer.) Let such that . Let . If is differentiable at , and is differentiable at then is differentiable at and . (Here, each of ,, and is a matrix.)
Intuitive Explanation
The single-variable Chain Rule is often explained by pointing out that
.
The first term on the right approaches , and the second term on the right approaches , as approaches . This can be made into a rigorous proof. (But we do have to worry about the possibility that , in which case we would be dividing by .)
This explanation of the chain rule fails in the multi-dimensional case, because in the multi-dimensional case is a vector, as is , and we can't divide by a vector.
However, there's another way to look at it.
Suppose a function is differentiable at , and is "small". Question: How much does change when its input changes from to ? (In other words, what is ?) Answer: approximately . This is true in the multi-dimensional case as well as in the single-variable case.
Well, suppose that (as above) , and is "small", and someone asks you how much changes when its input changes from to . That is the same as asking how much changes when its input changes from to , which is the same as asking how much changes when its input changes from to , where . And what is the answer to this question? The answer is: approximately, .
But what is ? In other words, how much does change when its input changes from to ? Answer: approximately .
Therefore, the amount that changes when its input changes from to is approximately .
We know that is supposed to be a matrix (or number, in the single-variable case) such that is a good approximation to . Thus, it seems that is a good candidate for being the matrix (or number) that is supposed to be.
This can be made into a rigorous proof. The standard proof of the multi-dimensional chain rule can be thought of in this way.
Proof
Here's a proof of the multi-variable Chain Rule. It's kind of a "rigorized" version of the intuitive argument given above.
I'll use the following fact. Assume , and . Then is differentiable at if and only if there exists an by matrix such that the "error" function has the property that approaches as approaches . (In fact, this can be taken as a definition of the statement " is differentiable at .") If such a matrix exists, then it is unique, and it is called . Intuitively, the fact that approaches as approaches just means that is approximated well by .
Okay, here's the proof.
Let and . (Here each of , , and is a positive integer.) Let such that . Let , and suppose that is differentiable at and is differentiable at .
In the intuitive argument, we said that if is "small", then , where . In this proof, we'll fix that statement up and make it rigorous. What we can say is, if , then , where is a function which has the property that .
Now let's work on . In the intuitive argument, we said that . In this proof, we'll make that rigorous by saying , where has the property that .
Putting these pieces together, we find that
, where I have taken that messy error term and called it .
Now, we just need to show that as , in order to prove that is differentiable at and that .
I believe we've hit a point where intuition no longer guides us. In order to finish off the proof, we just need to look at and play around with it a bit. It's not that bad. For the time being, I'll leave the rest of the proof as an exercise for the reader. (Hint: If is an by matrix, then there exists a number such that for all .)
Here I'm going to spell out the details of the rest of this proof.
by the triangle inequality.
Let's call the first term on the right here the "first error term" and the second term on the right the "second error term." If we can show that the "first error term" and the "second error term" each approach as , then we'll be done.
which approaches as . So the "first error term" approaches . That's good. ( is the -norm of the matrix .)
What about the "second error term", ? Is that small? Well, on top we have the norm of with a certain (slightly complicated) input. We know that is supposed to be small, as long as its input is small. In fact, we know more than that. If you take , and divide it by the norm of its input, then that quotient is also supposed to be small, as long as the input of is small. This suggests an idea: divide by the norm of the input of , and look at what we get. But to make up for the fact that we are dividing by the norm of the input of , we will also have to multiply by the norm of the input of .
Idea:
The first term on the right should approach , and the second term on the right hopefully at least remains bounded, as .
This idea seems promising, but there is a problem with it. We might be dividing by . When we divide by the norm of the input of , we might be dividing by . Fortunately, this idea can be fixed.
Let's introduce a function such that is equal to if , and is if . Then for all , and as .
.
Certainly as . Also, since , we know that as . So as , which means that as .
.
This remains bounded as .
We have shown that the "second error term" is a product of one term that approaches and another term that remains bounded as . Therefore, the "second error term" approaches as . So the proof is complete.