Difference between revisions of "Chain Rule"

m
(added a proof of multi-variable chain rule)
Line 55: Line 55:
  
 
This can be made into a rigorous proof.  The standard proof of the multi-dimensional chain rule can be thought of in this way.
 
This can be made into a rigorous proof.  The standard proof of the multi-dimensional chain rule can be thought of in this way.
 +
 +
 +
== Proof ==
 +
 +
 +
Here's a proof of the multi-variable Chain Rule.  It's kind of a "rigorized" version of the intuitive argument given above.
 +
 +
 +
 +
I'll use the following fact.  Assume <math>F: \mathbb{R}^n \to \mathbb{R}^m</math>, and <math>x \in \mathbb{R}^n</math>.  Then <math>F</math> is differentiable at <math>{x}</math> if and only if there exists an <math>m</math> by <math>n</math> matrix <math>M</math> such that the "error" function <math>{E_F(\Delta x)= F(x+\Delta x)-F(x)-M\cdot \Delta x}</math> has the property that <math>\frac{|E_F(\Delta x)|}{|\Delta x|}</math> approaches <math>0</math> as <math>\Delta x</math> approaches <math>0</math>.  (In fact, this can be taken as a definition of the statement "<math>F</math> is differentiable at <math>{x}</math>.")  If such a matrix <math>M</math> exists, then it is unique, and it is called <math>F'(x)</math>.  Intuitively, the fact that <math>\frac{|E_F(\Delta x)|}{|\Delta x|}</math> approaches <math>0</math> as <math>\Delta x</math> approaches <math>0</math> just means that <math>F(x + \Delta x)-F(x)</math> is approximated well by <math>M \cdot \Delta x</math>.
 +
 +
 +
 +
 +
Okay, here's the proof.
 +
 +
 +
Let <math>g:\mathbb{R}^n \to \mathbb{R}^m</math> and <math>f:\mathbb{R}^m \to \mathbb{R}^p</math>.  (Here each of <math>n</math>, <math>m</math>, and <math>{p}</math> is a positive integer.)  Let <math>{h}: \mathbb{R}^n \to \mathbb{R}^p</math> such that <math>h(x) = f(g(x)) \forall x \in \mathbb{R}^n</math>.  Let <math>x_0 \in \mathbb{R}^n</math>, and suppose that <math>g</math> is differentiable at <math>{x_0}</math> and <math>f</math> is differentiable at <math>g(x_0)</math>.
 +
 +
 +
In the intuitive argument, we said that if <math>\Delta x</math> is "small", then <math>\Delta h = f(g(x_0+\Delta x))-f(g(x_0)) \approx f'(g(x_0))\cdot \Delta g</math>, where <math>\Delta g = g(x_0+\Delta x)-g(x_0)</math>.  In this proof, we'll fix that statement up and make it rigorous.  What we can say is, if <math>\Delta x \in \mathbb{R}^n</math>, then <math>\Delta h = f(g(x_0)+\Delta g)-f(g(x_0)) = f'(g(x_0))\cdot \Delta g + E_f(\Delta g)</math>, where <math>E_f:\mathbb{R}^m \to \mathbb{R}^p</math> is a function which has the property that <math>\lim_{\Delta g \to 0} \frac{|E_f(\Delta g)|}{|\Delta g|}=0</math>.
 +
 +
 +
Now let's work on <math>\Delta g</math>.  In the intuitive argument, we said that <math>\Delta g \approx g'(x_0)\cdot \Delta x</math>.  In this proof, we'll make that rigorous by saying <math>\Delta g = g'(x_0)\cdot \Delta x + E_g(\Delta x)</math>, where <math>E_g:\mathbb{R}^n \to \mathbb{R}^m</math> has the property that <math>\lim_{\Delta x \to 0} \frac{|E_g(\Delta x)|}{\Delta x} = 0</math>.
 +
 +
 +
Putting these pieces together, we find that
 +
<math>\Delta h = f'(g(x_0))\Delta g + E_f(\Delta g)</math>
 +
<math>= f'(g(x_0))\left(g'(x_0)\Delta x + E_g(\Delta x)\right) + E_f \left( g'(x_0)\Delta x + E_g(\Delta x) \right) </math>
 +
<math>=f'(g(x_0))g'(x_0)\Delta x + f'(g(x_0))E_g(\Delta x) + E_f \left( g'(x_0)\Delta x + E_g(\Delta x) \right)</math>
 +
<math>= f'(g(x_0))g'(x_0)\Delta x + E_h(\Delta x)</math>, where I have taken that messy error term and called it <math>E_h(\Delta x)</math>.
 +
 +
 +
Now, we just need to show that <math>\frac{|E_h(\Delta x)|}{|\Delta x|} \to 0</math> as <math>\Delta x \to 0</math>, in order to prove that <math>h</math> is differentiable at <math>{x_0}</math> and that <math>h'(x_0) = f'(g(x_0))g'(x_0)</math>.
 +
 +
 +
I believe we've hit a point where intuition no longer guides us.  In order to finish off the proof, we just need to look at <math>E_h(\Delta x)</math> and play around with it a bit.  It's not that bad.  For the time being, I'll leave the rest of the proof as an exercise for the reader.  (Hint: If <math>A</math> is an <math>m</math> by <math>n</math> matrix, then there exists a number <math>k > 0</math> such that <math>|Ax| \le k|x|</math> for all <math>x \in \mathbb{R}^n</math>.)
 +
 +
  
 
== See also ==
 
== See also ==
 
* [[Calculus]]
 
* [[Calculus]]

Revision as of 23:42, 22 June 2006

Statement

Basically, the Chain Rule says that if $h(x) = f(g(x))$, then $h'(x)=f'(g(x))g'(x)$.


For example, if $f(x)=\sin{x}$ , $g(x)=x^2$, and $h(x)=f(g(x))=\sin{(x^2)}$, then $h'(x) = \cos{(x^2)}\cdot(2x)$.


Here are some more precise statements for the single-variable and multi-variable cases.


Single variable Chain Rule:


Let each of $I \subset \mathbb{R}, J \subset \mathbb{R}$ be an open interval, and suppose $g:I \to J$ and $f:J \to \mathbb{R}$. Let $h:I \to \mathbb{R}$ such that $h(x) = f(g(x)) \forall x \in I$. If $x_0 \in I$, $g$ is differentiable at ${x_0}$, and ${f}$ is differentiable at $g(x_0),$ then ${h}$ is differentiable at ${x_0}$, and ${h'(x_0) = f'(g(x_0))\cdot g'(x_0)}$.


Multi-dimensional Chain Rule:


Let $g:\mathbb{R}^n \to \mathbb{R}^m$ and $f:\mathbb{R}^m \to \mathbb{R}^p$. (Here each of $n$, $m$, and ${p}$ is a positive integer.) Let ${h}: \mathbb{R}^n \to \mathbb{R}^p$ such that $h(x) = f(g(x)) \forall x \in \mathbb{R}^n$. Let $x_0 \in \mathbb{R}^n$. If $g$ is differentiable at ${x_0}$, and ${f}$ is differentiable at $g(x_0),$ then $h$ is differentiable at ${x_0}$ and $h'(x_0) = f'(g(x_0))\cdot g'(x_0)$. (Here, each of $h'(x_0)$,$f'(g(x_0))$, and $g'(x_0)$ is a matrix.)


Intuitive Explanation

The single-variable Chain Rule is often explained by pointing out that


$\frac{f(g(x+\Delta x)) - f(g(x))}{\Delta x} = \frac{f(g(x+\Delta x)) - f(g(x))}{g(x+ \Delta x)-g(x)}\cdot \frac{g(x+ \Delta x)-g(x)}{\Delta x}$.


The first term on the right approaches $f'(g(x))$, and the second term on the right approaches $g'(x)$, as $\Delta x$ approaches $0$. This can be made into a rigorous proof. (But we do have to worry about the possibility that $g(x+\Delta x) - g(x)=0$, in which case we would be dividing by $0$.)


This explanation of the chain rule fails in the multi-dimensional case, because in the multi-dimensional case $\Delta x$ is a vector, as is $g(x+\Delta x) - g(x)$, and we can't divide by a vector.


However, there's another way to look at it.


Suppose a function $F$ is differentiable at $x$, and $\Delta x$ is "small". Question: How much does $F$ change when its input changes from $x$ to $x+ \Delta x$? (In other words, what is $F(x+ \Delta x) - F(x)$?) Answer: approximately $F'(x) \cdot \Delta x$. This is true in the multi-dimensional case as well as in the single-variable case.


Well, suppose that (as above) $h(x) = f(g(x))$, and $\Delta x$ is "small", and someone asks you how much $h$ changes when its input changes from $x$ to $x+ \Delta x$. That is the same as asking how much $f$ changes when its input changes from $g(x)$ to $g(x+ \Delta x)$, which is the same as asking how much $f$ changes when its input changes from $g(x)$ to $g(x) + \Delta g$, where $\Delta g = g(x+ \Delta x) - g(x)$. And what is the answer to this question? The answer is: approximately, $f'(g(x)) \cdot \Delta g$.


But what is $\Delta g$ ? In other words, how much does $g$ change when its input changes from $x$ to $x+ \Delta x$? Answer: approximately $g'(x) \cdot \Delta x$.


Therefore, the amount that $h$ changes when its input changes from $x$ to $x+ \Delta x$ is approximately ${f'(g(x)) \cdot g'(x) \cdot \Delta x}$.


We know that $h'(x)$ is supposed to be a matrix (or number, in the single-variable case) such that $h'(x) \cdot \Delta x$ is a good approximation to $h(x+ \Delta x) - h(x)$. Thus, it seems that $f'(g(x)) \cdot g'(x)$ is a good candidate for being the matrix (or number) that $h'(x)$ is supposed to be.


This can be made into a rigorous proof. The standard proof of the multi-dimensional chain rule can be thought of in this way.


Proof

Here's a proof of the multi-variable Chain Rule. It's kind of a "rigorized" version of the intuitive argument given above.


I'll use the following fact. Assume $F: \mathbb{R}^n \to \mathbb{R}^m$, and $x \in \mathbb{R}^n$. Then $F$ is differentiable at ${x}$ if and only if there exists an $m$ by $n$ matrix $M$ such that the "error" function ${E_F(\Delta x)= F(x+\Delta x)-F(x)-M\cdot \Delta x}$ has the property that $\frac{|E_F(\Delta x)|}{|\Delta x|}$ approaches $0$ as $\Delta x$ approaches $0$. (In fact, this can be taken as a definition of the statement "$F$ is differentiable at ${x}$.") If such a matrix $M$ exists, then it is unique, and it is called $F'(x)$. Intuitively, the fact that $\frac{|E_F(\Delta x)|}{|\Delta x|}$ approaches $0$ as $\Delta x$ approaches $0$ just means that $F(x + \Delta x)-F(x)$ is approximated well by $M \cdot \Delta x$.



Okay, here's the proof.


Let $g:\mathbb{R}^n \to \mathbb{R}^m$ and $f:\mathbb{R}^m \to \mathbb{R}^p$. (Here each of $n$, $m$, and ${p}$ is a positive integer.) Let ${h}: \mathbb{R}^n \to \mathbb{R}^p$ such that $h(x) = f(g(x)) \forall x \in \mathbb{R}^n$. Let $x_0 \in \mathbb{R}^n$, and suppose that $g$ is differentiable at ${x_0}$ and $f$ is differentiable at $g(x_0)$.


In the intuitive argument, we said that if $\Delta x$ is "small", then $\Delta h = f(g(x_0+\Delta x))-f(g(x_0)) \approx f'(g(x_0))\cdot \Delta g$, where $\Delta g = g(x_0+\Delta x)-g(x_0)$. In this proof, we'll fix that statement up and make it rigorous. What we can say is, if $\Delta x \in \mathbb{R}^n$, then $\Delta h = f(g(x_0)+\Delta g)-f(g(x_0)) = f'(g(x_0))\cdot \Delta g + E_f(\Delta g)$, where $E_f:\mathbb{R}^m \to \mathbb{R}^p$ is a function which has the property that $\lim_{\Delta g \to 0} \frac{|E_f(\Delta g)|}{|\Delta g|}=0$.


Now let's work on $\Delta g$. In the intuitive argument, we said that $\Delta g \approx g'(x_0)\cdot \Delta x$. In this proof, we'll make that rigorous by saying $\Delta g = g'(x_0)\cdot \Delta x + E_g(\Delta x)$, where $E_g:\mathbb{R}^n \to \mathbb{R}^m$ has the property that $\lim_{\Delta x \to 0} \frac{|E_g(\Delta x)|}{\Delta x} = 0$.


Putting these pieces together, we find that $\Delta h = f'(g(x_0))\Delta g + E_f(\Delta g)$ $= f'(g(x_0))\left(g'(x_0)\Delta x + E_g(\Delta x)\right) + E_f \left( g'(x_0)\Delta x + E_g(\Delta x) \right)$ $=f'(g(x_0))g'(x_0)\Delta x + f'(g(x_0))E_g(\Delta x) + E_f \left( g'(x_0)\Delta x + E_g(\Delta x) \right)$ $= f'(g(x_0))g'(x_0)\Delta x + E_h(\Delta x)$, where I have taken that messy error term and called it $E_h(\Delta x)$.


Now, we just need to show that $\frac{|E_h(\Delta x)|}{|\Delta x|} \to 0$ as $\Delta x \to 0$, in order to prove that $h$ is differentiable at ${x_0}$ and that $h'(x_0) = f'(g(x_0))g'(x_0)$.


I believe we've hit a point where intuition no longer guides us. In order to finish off the proof, we just need to look at $E_h(\Delta x)$ and play around with it a bit. It's not that bad. For the time being, I'll leave the rest of the proof as an exercise for the reader. (Hint: If $A$ is an $m$ by $n$ matrix, then there exists a number $k > 0$ such that $|Ax| \le k|x|$ for all $x \in \mathbb{R}^n$.)


See also