Difference between revisions of "Chain Rule"

m (punctuation)
(this is an encyclopedia article, not a personal mathematics paper.)
Line 11: Line 11:
  
  
Single variable Chain Rule:
+
===Single variable Chain Rule===
  
  
Line 17: Line 17:
  
  
Multi-dimensional Chain Rule:
+
===Multi-dimensional Chain Rule===
  
  
Line 36: Line 36:
  
  
However, there's another way to look at it.
+
There's another way to look at it:
  
  
Line 42: Line 42:
  
  
Well, suppose that (as above) <math>h(x) = f(g(x))</math>, and <math>\Delta x</math> is "small", and someone asks you how much <math>h</math> changes when its input changes from <math>x</math> to <math>x+ \Delta x</math>.  That is the ''same'' as asking how much <math>f</math> changes when its input changes from <math>g(x)</math> to <math>g(x+ \Delta x)</math>,  which is the same as asking how much <math>f</math> changes when its input changes from <math>g(x)</math> to <math>g(x) + \Delta g</math>, where <math>\Delta g = g(x+ \Delta x) - g(x)</math>.  And what is the answer to this question?  The answer is: approximately, <math>f'(g(x)) \cdot \Delta g</math>.
+
Suppose that (as above) <math>h(x) = f(g(x))</math>, and <math>\Delta x</math> is "small", and someone asks you how much <math>h</math> changes when its input changes from <math>x</math> to <math>x+ \Delta x</math>.  That is the ''same'' as asking how much <math>f</math> changes when its input changes from <math>g(x)</math> to <math>g(x+ \Delta x)</math>,  which is the same as asking how much <math>f</math> changes when its input changes from <math>g(x)</math> to <math>g(x) + \Delta g</math>, where <math>\Delta g = g(x+ \Delta x) - g(x)</math>.  And what is the answer to this question?  The answer is: approximately, <math>f'(g(x)) \cdot \Delta g</math>.
  
  
But what is <math>\Delta g</math> ?  In other words, how much does <math>g</math> change when its input changes from <math>x</math> to <math>x+ \Delta x</math>?  Answer:  approximately <math>g'(x) \cdot \Delta x</math>.
+
We must determine how much does <math>g</math> change when its input changes from <math>x</math> to <math>x+ \Delta x</math>?  Answer:  approximately <math>g'(x) \cdot \Delta x</math>.
  
  
Line 60: Line 60:
  
  
Here's a proof of the multi-variable Chain Rule.  It's kind of a "rigorized" version of the intuitive argument given above.
+
The following is a proof of the multi-variable Chain Rule.  It's a "rigorized" version of the intuitive argument given above.
  
  
  
I'll use the following fact. Assume <math>F: \mathbb{R}^n \to \mathbb{R}^m</math>, and <math>x \in \mathbb{R}^n</math>.  Then <math>F</math> is differentiable at <math>{x}</math> if and only if there exists an <math>m</math> by <math>n</math> matrix <math>M</math> such that the "error" function <math>{E_F(\Delta x)= F(x+\Delta x)-F(x)-M\cdot \Delta x}</math> has the property that <math>\frac{|E_F(\Delta x)|}{|\Delta x|}</math> approaches <math>0</math> as <math>\Delta x</math> approaches <math>0</math>.  (In fact, this can be taken as a definition of the statement "<math>F</math> is differentiable at <math>{x}</math>.")  If such a matrix <math>M</math> exists, then it is unique, and it is called <math>F'(x)</math>.  Intuitively, the fact that <math>\frac{|E_F(\Delta x)|}{|\Delta x|}</math> approaches <math>0</math> as <math>\Delta x</math> approaches <math>0</math> just means that <math>F(x + \Delta x)-F(x)</math> is approximated well by <math>M \cdot \Delta x</math>.
+
This proof uses the following fact: Assume <math>F: \mathbb{R}^n \to \mathbb{R}^m</math>, and <math>x \in \mathbb{R}^n</math>.  Then <math>F</math> is differentiable at <math>{x}</math> if and only if there exists an <math>m</math> by <math>n</math> matrix <math>M</math> such that the "error" function <math>{E_F(\Delta x)= F(x+\Delta x)-F(x)-M\cdot \Delta x}</math> has the property that <math>\frac{|E_F(\Delta x)|}{|\Delta x|}</math> approaches <math>0</math> as <math>\Delta x</math> approaches <math>0</math>.  (In fact, this can be taken as a definition of the statement "<math>F</math> is differentiable at <math>{x}</math>.")  If such a matrix <math>M</math> exists, then it is unique, and it is called <math>F'(x)</math>.  Intuitively, the fact that <math>\frac{|E_F(\Delta x)|}{|\Delta x|}</math> approaches <math>0</math> as <math>\Delta x</math> approaches <math>0</math> just means that <math>F(x + \Delta x)-F(x)</math> is approximated well by <math>M \cdot \Delta x</math>.
 
 
 
 
 
 
 
 
Okay, here's the proof.
 
  
  
Line 75: Line 70:
  
  
In the intuitive argument, we said that if <math>\Delta x</math> is "small", then <math>\Delta h = f(g(x_0+\Delta x))-f(g(x_0)) \approx f'(g(x_0))\cdot \Delta g</math>, where <math>\Delta g = g(x_0+\Delta x)-g(x_0)</math>.  In this proof, we'll fix that statement up and make it rigorous.  What we can say is, if <math>\Delta x \in \mathbb{R}^n</math>, then <math>\Delta h = f(g(x_0)+\Delta g)-f(g(x_0)) = f'(g(x_0))\cdot \Delta g + E_f(\Delta g)</math>, where <math>E_f:\mathbb{R}^m \to \mathbb{R}^p</math> is a function which has the property that <math>\lim_{\Delta g \to 0} \frac{|E_f(\Delta g)|}{|\Delta g|}=0</math>.
+
In the intuitive argument, we stated that if <math>\Delta x</math> is "small", then <math>\Delta h = f(g(x_0+\Delta x))-f(g(x_0)) \approx f'(g(x_0))\cdot \Delta g</math>, where <math>\Delta g = g(x_0+\Delta x)-g(x_0)</math>.  In this proof, we'll fix that statement up and make it rigorous.  What we can say is, if <math>\Delta x \in \mathbb{R}^n</math>, then <math>\Delta h = f(g(x_0)+\Delta g)-f(g(x_0)) = f'(g(x_0))\cdot \Delta g + E_f(\Delta g)</math>, where <math>E_f:\mathbb{R}^m \to \mathbb{R}^p</math> is a function which has the property that <math>\lim_{\Delta g \to 0} \frac{|E_f(\Delta g)|}{|\Delta g|}=0</math>.
  
 +
In the intuitive argument, we said that <math>\Delta g \approx g'(x_0)\cdot \Delta x</math>.  In this proof, we'll make that rigorous by saying <math>\Delta g = g'(x_0)\cdot \Delta x + E_g(\Delta x)</math>, where <math>E_g:\mathbb{R}^n \to \mathbb{R}^m</math> has the property that <math>\lim_{\Delta x \to 0} \frac{|E_g(\Delta x)|}{\Delta x} = 0</math>.
  
Now let's work on <math>\Delta g</math>.  In the intuitive argument, we said that <math>\Delta g \approx g'(x_0)\cdot \Delta x</math>.  In this proof, we'll make that rigorous by saying <math>\Delta g = g'(x_0)\cdot \Delta x + E_g(\Delta x)</math>, where <math>E_g:\mathbb{R}^n \to \mathbb{R}^m</math> has the property that <math>\lim_{\Delta x \to 0} \frac{|E_g(\Delta x)|}{\Delta x} = 0</math>.
 
  
 
+
Putting these together, we find that
Putting these pieces together, we find that
 
 
<math>\Delta h = f'(g(x_0))\Delta g + E_f(\Delta g)</math>  
 
<math>\Delta h = f'(g(x_0))\Delta g + E_f(\Delta g)</math>  
 
<math>= f'(g(x_0))\left(g'(x_0)\Delta x + E_g(\Delta x)\right) + E_f \left( g'(x_0)\Delta x + E_g(\Delta x) \right) </math>
 
<math>= f'(g(x_0))\left(g'(x_0)\Delta x + E_g(\Delta x)\right) + E_f \left( g'(x_0)\Delta x + E_g(\Delta x) \right) </math>
Line 88: Line 82:
  
  
Now, we just need to show that <math>\frac{|E_h(\Delta x)|}{|\Delta x|} \to 0</math> as <math>\Delta x \to 0</math>, in order to prove that <math>h</math> is differentiable at <math>{x_0}</math> and that <math>h'(x_0) = f'(g(x_0))g'(x_0)</math>.
+
Now, we need to show that <math>\frac{|E_h(\Delta x)|}{|\Delta x|} \to 0</math> as <math>\Delta x \to 0</math>, in order to prove that <math>h</math> is differentiable at <math>{x_0}</math> and that <math>h'(x_0) = f'(g(x_0))g'(x_0)</math>.
  
 
+
In order to finish off the proof, it's needed to look at <math>E_h(\Delta x)</math> and "play around with it", so to speak. The conclusion can be reached by the following fact: If <math>A</math> is an <math>m</math> by <math>n</math> matrix, then there exists a number <math>K > 0</math> such that <math>|Ax| \le K|x|</math> for all <math>x \in \mathbb{R}^n</math>.
I believe we've hit a point where intuition no longer guides us.  In order to finish off the proof, we just need to look at <math>E_h(\Delta x)</math> and play around with it a bit. It's not that bad.  For the time being, I'll leave the rest of the proof as an exercise for the reader.  (Hint: If <math>A</math> is an <math>m</math> by <math>n</math> matrix, then there exists a number <math>K > 0</math> such that <math>|Ax| \le K|x|</math> for all <math>x \in \mathbb{R}^n</math>.)
 
 
 
 
 
Here I'm going to spell out the details of the rest of this proof.
 
  
  
Line 100: Line 90:
  
  
Let's call the first term on the right here the "first error term" and the second term on the right the "second error term."  If we can show that the "first error term" and the "second error term" each approach <math>0</math> as <math>\Delta x \to 0</math>, then we'll be done.
+
We'll call the first term on the right here the "first error term" and the second term on the right the "second error term."  If we can show that the "first error term" and the "second error term" each approach <math>0</math> as <math>\Delta x \to 0</math>, then we'll be done.
  
  
Line 109: Line 99:
  
  
What about the "second error term", <math>\frac{|E_f \left( g'(x_0)\Delta x + E_g(\Delta x) \right)|}{|\Delta x|} </math> ?  Is that small?  Well, on top we have the norm of <math>E_f</math> with a certain (slightly complicated) input.  We know that <math>E_f</math> is supposed to be small, as long as its input is small.  In fact, we know more than that.  If you take <math>E_f</math>, and divide it by the norm of its input, then that quotient is also supposed to be small, as long as the input of <math>E_f</math> is small.  This suggests an idea: divide by the norm of the input of <math>E_f</math>, and look at what we get.  But to make up for the fact that we are dividing by the norm of the input of <math>E_f</math>, we will also have to multiply by the norm of the input of <math>E_f</math>.
+
Consider the "second error term", <math>\frac{|E_f \left( g'(x_0)\Delta x + E_g(\Delta x) \right)|}{|\Delta x|} </math>. On top we have the norm of <math>E_f</math> with a certain (slightly complicated) input.  We know that <math>E_f</math> is supposed to be small, as long as its input is small.  In fact, we know more than that.  If you take <math>E_f</math>, and divide it by the norm of its input, then that quotient is also supposed to be small, as long as the input of <math>E_f</math> is small.  This suggests an idea: divide by the norm of the input of <math>E_f</math>, and look at what we get.  But to make up for the fact that we are dividing by the norm of the input of <math>E_f</math>, we will also have to multiply by the norm of the input of <math>E_f</math>.
 
 
 
 
Idea:
 
 
 
  
 
<math>\frac{ |E_f \left( g'(x_0)\Delta x + E_g(\Delta x) \right)| }{|\Delta x|} =\frac{|E_f \left( g'(x_0)\Delta x + E_g(\Delta x) \right)|}{|g'(x_0)\Delta x + E_g(\Delta x)| } \cdot \frac{|g'(x_0)\Delta x + E_g(\Delta x)|}{|\Delta x|}</math>
 
<math>\frac{ |E_f \left( g'(x_0)\Delta x + E_g(\Delta x) \right)| }{|\Delta x|} =\frac{|E_f \left( g'(x_0)\Delta x + E_g(\Delta x) \right)|}{|g'(x_0)\Delta x + E_g(\Delta x)| } \cdot \frac{|g'(x_0)\Delta x + E_g(\Delta x)|}{|\Delta x|}</math>
Line 121: Line 107:
  
  
This idea seems promising, but there is a problem with it.  We might be dividing by <math>0</math>.  When we divide by the norm of the input of <math>E_f</math>, we might be dividing by <math>0</math>. Fortunately, this idea can be fixed.
+
This idea is promising, but there is a problem with it.  When we divide by the norm of the input of <math>E_f</math>, we may be dividing by <math>0</math>. The following argument can resolve this anomaly.
  
  
Let's introduce a function <math>e_f</math> such that <math>e_f(z)</math> is equal to <math>\frac{|E_f(z)|}{|z|}</math> if <math>z \neq 0</math>, and <math>e_f(z)</math> is <math>0</math> if <math>z = 0</math>.  Then <math>{|E_f(z)|=e_f(z)\cdot|z|}</math> for all <math>z</math>, and <math>e_f(z) \to 0</math> as <math>z \to 0</math>.
+
We introduce a function <math>e_f</math> such that <math>e_f(z)</math> is equal to <math>\frac{|E_f(z)|}{|z|}</math> if <math>z \neq 0</math>, and <math>e_f(z)</math> is <math>0</math> if <math>z = 0</math>.  Then <math>{|E_f(z)|=e_f(z)\cdot|z|}</math> for all <math>z</math>, and <math>e_f(z) \to 0</math> as <math>z \to 0</math>.
  
  
Line 145: Line 131:
  
  
We have shown that the "second error term" is a product of one term that approaches <math>0</math> and another term that remains bounded as <math>\Delta x \to 0</math>.  Therefore, the "second error term" approaches <math>0</math> as <math>\Delta x \to 0</math>.  So the proof is complete.
+
We have shown that the "second error term" is a product of one term that approaches <math>0</math> and another term that remains bounded as <math>\Delta x \to 0</math>.  Therefore, the "second error term" approaches <math>0</math> as <math>\Delta x \to 0</math>.   
  
 
== See also ==
 
== See also ==

Revision as of 20:27, 15 November 2007

The Chain Rule is an essential. theorem of calculus.

Theorem

The theorem states that if $h(x) = f(g(x))$, then $h'(x)=f'(g(x))\cdot g'(x)$ wherever those expressions make sense.


For example, if $f(x)=\sin{x}$ , $g(x)=x^2$, and $h(x)=f(g(x))=\sin{(x^2)}$, then $h'(x) = \cos{(x^2)}\cdot(2x)$.


Here are some more precise statements for the single-variable and multi-variable cases.


Single variable Chain Rule

Let each of $I \subset \mathbb{R}, J \subset \mathbb{R}$ be an open interval, and suppose $g:I \to J$ and $f:J \to \mathbb{R}$. Let $h:I \to \mathbb{R}$ such that $h(x) = f(g(x)) \forall x \in I$. If $x_0 \in I$, $g$ is differentiable at ${x_0}$, and ${f}$ is differentiable at $g(x_0),$ then ${h}$ is differentiable at ${x_0}$, and ${h'(x_0) = f'(g(x_0))\cdot g'(x_0)}$.


Multi-dimensional Chain Rule

Let $g:\mathbb{R}^n \to \mathbb{R}^m$ and $f:\mathbb{R}^m \to \mathbb{R}^p$. (Here each of $n$, $m$, and ${p}$ is a positive integer.) Let ${h}: \mathbb{R}^n \to \mathbb{R}^p$ such that $h(x) = f(g(x)) \forall x \in \mathbb{R}^n$. Let $x_0 \in \mathbb{R}^n$. If $g$ is differentiable at ${x_0}$, and ${f}$ is differentiable at $g(x_0),$ then $h$ is differentiable at ${x_0}$ and $h'(x_0) = f'(g(x_0))\cdot g'(x_0)$. (Here, each of $h'(x_0)$,$f'(g(x_0))$, and $g'(x_0)$ is a matrix.)

Intuitive Explanation

The single-variable Chain Rule is often explained by pointing out that


$\frac{f(g(x+\Delta x)) - f(g(x))}{\Delta x} = \frac{f(g(x+\Delta x)) - f(g(x))}{g(x+ \Delta x)-g(x)}\cdot \frac{g(x+ \Delta x)-g(x)}{\Delta x}$.


The first term on the right approaches $f'(g(x))$, and the second term on the right approaches $g'(x)$, as $\Delta x$ approaches $0$. This can be made into a rigorous proof. (But we do have to worry about the possibility that $g(x+\Delta x) - g(x)=0$, in which case we would be dividing by $0$.)


This explanation of the chain rule fails in the multi-dimensional case, because in the multi-dimensional case $\Delta x$ is a vector, as is $g(x+\Delta x) - g(x)$, and we can't divide by a vector.


There's another way to look at it:


Suppose a function $F$ is differentiable at $x$, and $\Delta x$ is "small". Question: How much does $F$ change when its input changes from $x$ to $x+ \Delta x$? (In other words, what is $F(x+ \Delta x) - F(x)$?) Answer: approximately $F'(x) \cdot \Delta x$. This is true in the multi-dimensional case as well as in the single-variable case.


Suppose that (as above) $h(x) = f(g(x))$, and $\Delta x$ is "small", and someone asks you how much $h$ changes when its input changes from $x$ to $x+ \Delta x$. That is the same as asking how much $f$ changes when its input changes from $g(x)$ to $g(x+ \Delta x)$, which is the same as asking how much $f$ changes when its input changes from $g(x)$ to $g(x) + \Delta g$, where $\Delta g = g(x+ \Delta x) - g(x)$. And what is the answer to this question? The answer is: approximately, $f'(g(x)) \cdot \Delta g$.


We must determine how much does $g$ change when its input changes from $x$ to $x+ \Delta x$? Answer: approximately $g'(x) \cdot \Delta x$.


Therefore, the amount that $h$ changes when its input changes from $x$ to $x+ \Delta x$ is approximately ${f'(g(x)) \cdot g'(x) \cdot \Delta x}$.


We know that $h'(x)$ is supposed to be a matrix (or number, in the single-variable case) such that $h'(x) \cdot \Delta x$ is a good approximation to $h(x+ \Delta x) - h(x)$. Thus, it seems that $f'(g(x)) \cdot g'(x)$ is a good candidate for being the matrix (or number) that $h'(x)$ is supposed to be.


This can be made into a rigorous proof. The standard proof of the multi-dimensional chain rule can be thought of in this way.


Proof

The following is a proof of the multi-variable Chain Rule. It's a "rigorized" version of the intuitive argument given above.


This proof uses the following fact: Assume $F: \mathbb{R}^n \to \mathbb{R}^m$, and $x \in \mathbb{R}^n$. Then $F$ is differentiable at ${x}$ if and only if there exists an $m$ by $n$ matrix $M$ such that the "error" function ${E_F(\Delta x)= F(x+\Delta x)-F(x)-M\cdot \Delta x}$ has the property that $\frac{|E_F(\Delta x)|}{|\Delta x|}$ approaches $0$ as $\Delta x$ approaches $0$. (In fact, this can be taken as a definition of the statement "$F$ is differentiable at ${x}$.") If such a matrix $M$ exists, then it is unique, and it is called $F'(x)$. Intuitively, the fact that $\frac{|E_F(\Delta x)|}{|\Delta x|}$ approaches $0$ as $\Delta x$ approaches $0$ just means that $F(x + \Delta x)-F(x)$ is approximated well by $M \cdot \Delta x$.


Let $g:\mathbb{R}^n \to \mathbb{R}^m$ and $f:\mathbb{R}^m \to \mathbb{R}^p$. (Here each of $n$, $m$, and ${p}$ is a positive integer.) Let ${h}: \mathbb{R}^n \to \mathbb{R}^p$ such that $h(x) = f(g(x)) \forall x \in \mathbb{R}^n$. Let $x_0 \in \mathbb{R}^n$, and suppose that $g$ is differentiable at ${x_0}$ and $f$ is differentiable at $g(x_0)$.


In the intuitive argument, we stated that if $\Delta x$ is "small", then $\Delta h = f(g(x_0+\Delta x))-f(g(x_0)) \approx f'(g(x_0))\cdot \Delta g$, where $\Delta g = g(x_0+\Delta x)-g(x_0)$. In this proof, we'll fix that statement up and make it rigorous. What we can say is, if $\Delta x \in \mathbb{R}^n$, then $\Delta h = f(g(x_0)+\Delta g)-f(g(x_0)) = f'(g(x_0))\cdot \Delta g + E_f(\Delta g)$, where $E_f:\mathbb{R}^m \to \mathbb{R}^p$ is a function which has the property that $\lim_{\Delta g \to 0} \frac{|E_f(\Delta g)|}{|\Delta g|}=0$.

In the intuitive argument, we said that $\Delta g \approx g'(x_0)\cdot \Delta x$. In this proof, we'll make that rigorous by saying $\Delta g = g'(x_0)\cdot \Delta x + E_g(\Delta x)$, where $E_g:\mathbb{R}^n \to \mathbb{R}^m$ has the property that $\lim_{\Delta x \to 0} \frac{|E_g(\Delta x)|}{\Delta x} = 0$.


Putting these together, we find that $\Delta h = f'(g(x_0))\Delta g + E_f(\Delta g)$ $= f'(g(x_0))\left(g'(x_0)\Delta x + E_g(\Delta x)\right) + E_f \left( g'(x_0)\Delta x + E_g(\Delta x) \right)$ $=f'(g(x_0))g'(x_0)\Delta x + f'(g(x_0))E_g(\Delta x) + E_f \left( g'(x_0)\Delta x + E_g(\Delta x) \right)$ $= f'(g(x_0))g'(x_0)\Delta x + E_h(\Delta x)$, where I have taken that messy error term and called it $E_h(\Delta x)$.


Now, we need to show that $\frac{|E_h(\Delta x)|}{|\Delta x|} \to 0$ as $\Delta x \to 0$, in order to prove that $h$ is differentiable at ${x_0}$ and that $h'(x_0) = f'(g(x_0))g'(x_0)$.

In order to finish off the proof, it's needed to look at $E_h(\Delta x)$ and "play around with it", so to speak. The conclusion can be reached by the following fact: If $A$ is an $m$ by $n$ matrix, then there exists a number $K > 0$ such that $|Ax| \le K|x|$ for all $x \in \mathbb{R}^n$.


$\frac{|E_h(\Delta x)|}{|\Delta x|} \leq \frac{|f'(g(x_0))E_g(\Delta x)|}{|\Delta x|} + \frac{|E_f \left( g'(x_0)\Delta x + E_g(\Delta x) \right)|}{|\Delta x|}$ by the triangle inequality.


We'll call the first term on the right here the "first error term" and the second term on the right the "second error term." If we can show that the "first error term" and the "second error term" each approach $0$ as $\Delta x \to 0$, then we'll be done.


$\frac{|f'(g(x_0))E_g(\Delta x)|}{|\Delta x|} \leq \frac{ \Vert f'(g(x_0)) \Vert_2 |E_g(\Delta x)|}{|\Delta x|} = \Vert f'(g(x_0)) \Vert_2 \frac{|E_g(\Delta x)|}{|\Delta x|}$ which approaches $0$ as $\Delta x \to 0$. So the "first error term" approaches $0$. That's good. ($\Vert f'(g(x_o)) \Vert_2$ is the $2$-norm of the matrix $f'(g(x_0))$.)



Consider the "second error term", $\frac{|E_f \left( g'(x_0)\Delta x + E_g(\Delta x) \right)|}{|\Delta x|}$. On top we have the norm of $E_f$ with a certain (slightly complicated) input. We know that $E_f$ is supposed to be small, as long as its input is small. In fact, we know more than that. If you take $E_f$, and divide it by the norm of its input, then that quotient is also supposed to be small, as long as the input of $E_f$ is small. This suggests an idea: divide by the norm of the input of $E_f$, and look at what we get. But to make up for the fact that we are dividing by the norm of the input of $E_f$, we will also have to multiply by the norm of the input of $E_f$.

$\frac{ |E_f \left( g'(x_0)\Delta x + E_g(\Delta x) \right)| }{|\Delta x|} =\frac{|E_f \left( g'(x_0)\Delta x + E_g(\Delta x) \right)|}{|g'(x_0)\Delta x + E_g(\Delta x)| } \cdot \frac{|g'(x_0)\Delta x + E_g(\Delta x)|}{|\Delta x|}$


The first term on the right should approach $0$, and the second term on the right hopefully at least remains bounded, as $\Delta x \to 0$.


This idea is promising, but there is a problem with it. When we divide by the norm of the input of $E_f$, we may be dividing by $0$. The following argument can resolve this anomaly.


We introduce a function $e_f$ such that $e_f(z)$ is equal to $\frac{|E_f(z)|}{|z|}$ if $z \neq 0$, and $e_f(z)$ is $0$ if $z = 0$. Then ${|E_f(z)|=e_f(z)\cdot|z|}$ for all $z$, and $e_f(z) \to 0$ as $z \to 0$.


$\frac{ |E_f \left( g'(x_0)\Delta x + E_g(\Delta x) \right)| }{|\Delta x|} = \frac{ e_f(g'(x_0)\Delta x + E_g(\Delta x)) \cdot |g'(x_0)\Delta x + E_g(\Delta x)|}{|\Delta x|}$

$=e_f(g'(x_0)\Delta x + E_g(\Delta x)) \cdot \frac{|g'(x_0)\Delta x + E_g(\Delta x)|}{|\Delta x|}$.


Certainly $E_g(\Delta x) \to 0$ as $\Delta x \to 0$ . Also, since $|g'(x_0)\Delta x| \leq \Vert g'(x_0) \Vert_2 |\Delta x|$, we know that $g'(x_0)\Delta x \to 0$ as $\Delta x \to 0$. So $g'(x_0)\Delta x + E_g(\Delta x) \to 0$ as $\Delta x \to 0$, which means that $e_f(g'(x_0)\Delta x + E_g(\Delta x) ) \to 0$ as $\Delta x \to 0$.


$\frac{|g'(x_0)\Delta x + E_g(\Delta x) | }{|\Delta x|}  \leq \frac{|g'(x_0)\Delta x|}{|\Delta x|} + \frac{|E_g(\Delta x)|}{|\Delta x|}$

$\leq \frac{\Vert g'(x_0) \Vert_2 |\Delta x|}{|\Delta x|} + \frac{|E_g(\Delta x)|}{|\Delta x|}$

$= \Vert g'(x_0) \Vert_2 + \frac{|E_g(\Delta x)|}{|\Delta x|}$ .


This remains bounded as $\Delta x \to 0$.


We have shown that the "second error term" is a product of one term that approaches $0$ and another term that remains bounded as $\Delta x \to 0$. Therefore, the "second error term" approaches $0$ as $\Delta x \to 0$.

See also