Numerical Analysis

Jonas Tölle1

Lecture notes for MS-C1650 Numerical Analysis, Aalto University

Last updated: 30.5.2024

Largely based the lecture transcript by Harri Hakula, 2021.

Intended learning outcomes. After the course, the student will be able to…

Floating-point numbers

The set of real numbers R\mathbb{R} is infinite in two ways: it is unbounded and continuous. In most practical computing, the second kind of infiniteness is more consequential than the first kind, so we turn our attention there first.

Instead of R\mathbb{R}, we shall introduce the set of floating-point numbers (floats) F\mathbb{F}. They come with different bases, precisions and exponent ranges, and other features. The basic representation is
x=±(d0.d1d2dp)kke.x=\pm (d_0. d_1 d_2 \ldots d_p)_k\cdot k^e.
kN{1}k\in\mathbb{N}\setminus\{1\} is called base or radix, pN0:=N{0}p\in\mathbb{N}_0:=\mathbb{N}\cup\{0\} is called the precision, did_i, i{0,,p}i\in\{0,\ldots,p\}, and the sequence of numbers
(d0.d1d2dp)k:=i=0pdiki(d_0. d_1 d_2 \ldots d_p)_k:=\sum_{i=0}^p d_i k^{-i}
is called mantissa or significand. The exponent eZe\in\mathbb{Z} is bounded meMm\le e\le M, where m,MZm,M\in\mathbb{Z}.
If k=10k=10, we can read the usual decimal commas from the mantissa:
(1.01)10=1100+0101+1102=1.01.(1.01)_{10}=1\cdot 10^0+0\cdot 10^{-1}+1\cdot 10^{-2}=1.01.
If k=2k=2, we have binary floats. In the binary case, we observe that we can always choose d0=1d_0=1, hence saving one bit, which can be expressed by ee. We refer to this as normalization. The mantissa is always contained in the interval [1,2)[1,2).

Example. (Toy floating point system). Binary floats of the type
(1.b1b2)2(1.b_1b_2)_2 with exponents e=1,0,1e=-1,0,1.
Hence (1.00)2=1(1.00)_2=1, (1.01)2=54(1.01)_2=\frac{5}{4}, (1.10)2=32(1.10)_2=\frac{3}{2}, and >(1.11)2=74(1.11)_2=\frac{7}{4}. By multiplying with the exponents 21=122^{-1}=\frac{1}{2}, 20=12^0=1, 21=22^1=2, we get the whole set:

ee
11 54\frac{5}{4} 32\frac{3}{2}
22 52\frac{5}{2} 33
12\frac{1}{2} 58\frac{5}{8} 34\frac{3}{4}

Important quantity: (1.01)21=14(1.01)_2-1=\frac{1}{4}, the so-called machine epsilon.

Define the machine epsilon by ε:=2p=(1.0001)21\varepsilon:=2^{-p}=(1.00\ldots 01)_2-1.

Rounding

For the rounding function round:RF\text{round}:\mathbb{R}\to\mathbb{F}, we have 5 alternative definitions:

It holds that round(x)=x(1+δ)\text{round}(x)=x(1+\delta), where δ<ε2|\delta|<\frac{\varepsilon}{2}, where ε\varepsilon denotes the machine epsilon. Note that usually δ\delta depends on xx.
There is a way to define the standard arithmetic operations on F\mathbb{F} such that
ab=round(a+b)=(a+b)(1+δ1),a\oplus b=\text{round}(a+b)=(a+b)(1+\delta_1),
ab=round(ab)=(ab)(1+δ2),a\ominus b=\text{round}(a-b)=(a-b)(1+\delta_2),
ab=round(ab)=ab(1+δ3),a\otimes b=\text{round}(ab)=ab(1+\delta_3),
ab=round(ab)=ab(1+δ4),b0.a\oslash b=\text{round}\left(\frac{a}{b}\right)=\frac{a}{b}(1+\delta_4),\quad b\not= 0.
Here, generally δiδj\delta_i\not=\delta_j, iji\not=j.

IEEE 754 “Double precision”

k=2k=2, 64 bits, where:

Exponent field Number Type of number
0000=000\ldots 00=0 ±(0.b1b2b52)221022\pm (0.b_1 b_2\ldots b_{52})_2 \cdot 2^{-1022} 00 or subnormal
0001=100\ldots 01=1 ±(1.b1b2b52)221022\pm (1.b_1 b_2\ldots b_{52})_2 \cdot 2^{-1022}
0010=200\ldots 10=2 ±(1.b1b2b52)221021\pm (1.b_1 b_2\ldots b_{52})_2 \cdot 2^{-1021}
\ldots \ldots
0111=102301\ldots 11=1023 ±(1.b1b2b52)220\pm (1.b_1 b_2\ldots b_{52})_2 \cdot 2^{0}
\ldots \ldots
1110=204611\ldots 10=2046 ±(1.b1b2b52)221023\pm (1.b_1 b_2\ldots b_{52})_2 \cdot 2^{1023}
1111=204711\ldots 11=2047 ±\pm \infty if b1=b2==b52=0b_1=b_2=\ldots=b_{52}=0, otherwise NaN exception

Thus, there are two zeros, two infinities and NaN which denotes “not a number”. The smallest positive normalized number is:
(1.0)2210222.210308.(1.0)_2\cdot 2^{-1022}\approx 2.2\cdot 10^{-308}.
The largest positive number is:
(1.11)2210231.810308(1.1\ldots 1)_2\cdot 2^{1023}\approx 1.8\cdot 10^{308}
The machine epsilon is:
2522.221016.2^{-52}\approx 2.22\cdot 10^{-16}.

Here’s an easy-to-follow video explaining floating point numbers (and a specific version of Newton’s algorithm).

Condition number and stability

Conditioning of problems

Assume that f:RRf:\mathbb{R}\to\mathbb{R} “solution map” of the problem, input numbers xx, x^\hat{x}, close in value, e.g. x^=round(x)\hat{x}=\text{round}(x). Set y:=f(x)y:=f(x), y^:=f(x^)\hat{y}:=f(\hat{x}).

Definition. The absolute condition number C(x)C(x) is defined by the relation
yy^C(x)xx^.|y-\hat{y}|\approx C(x)|x-\hat{x}|.
The relative condition number K(x)K(x) is defined by the relation
yy^yK(x)xx^x\left|\frac{y-\hat{y}}{y}\right|\approx K(x)\left|\frac{x-\hat{x}}{x}\right|

By the normalization, we guarantee that
(relative error in the output)K(x)×(relative error in the input).\text{(relative error in the output)}\approx K(x)\times \text{(relative error in the input)}.
Now,
yy^=f(x)f(x^)=f(x)f(x^)xx^undefinedf(x)  as  x^x(xx^)y-\hat{y}=f(x)-f(\hat{x})=\underbrace{\frac{f(x)-f(\hat{x})}{x-\hat{x}}}_{\to f'(x)\;\text{as}\;\hat{x}\to x}(x-\hat{x})
Thus, C(x)=f(x)C(x)=|f'(x)|.
Furthermore,
yy^y=f(x)f(x^)f(x)=f(x)f(x^)xx^undefinedf(x)  as  x^xxx^xxf(x)\frac{y-\hat{y}}{y}=\frac{f(x)-f(\hat{x})}{f(x)}=\underbrace{\frac{f(x)-f(\hat{x})}{x-\hat{x}}}_{\to f'(x)\;\text{as}\;\hat{x}\to x}\frac{x-\hat{x}}{x}\frac{x}{f(x)}
Thus, K(x)=xf(x)f(x)K(x)=\left|\frac{x f'(x)}{f(x)}\right|.

Example. f(x)=2xf(x)=2x, f(x)=2f'(x)=2. Thus, C(x)=2C(x)=2, K(x)=2x2x=1K(x)=\left|\frac{2x}{2x}\right|=1. This is a well-conditioned problem.

Example. g(x)=xg(x)=\sqrt{x}, g(x)=12xg'(x)=\frac{1}{2\sqrt{x}}. Thus, C(x)C(x) is becomes unbounded for x>0x>0 close to zero, e.g. x108x\approx 10^{-8} yields C(x)104C(x)\approx 10^4. On the other hand, K(x)=x2xx=12K(x)=\left|\frac{x}{2\sqrt{x}\sqrt{x}}\right|=\frac{1}{2}.

Stability of algorithms

Definition. An algorithm or numerical process is called stable if small changes in the input produce small changes in the output. It is called unstable if large changes in the output are produced by small changes in the input.

An algorithm is stable, if every step is well-conditioned (i.e. has a uniformly bounded condition number). It is unstable if any step is ill-conditioned (i.e. the condition number may become arbitrarily large).

Forward error analysis (FEA) is asking:
“How far are we from the true solution?”

Backward error analysis (BEA) is asking:
“Given the answer, what was the problem?”

Example.
Set:
fl(x+y):=round(x)round(y)=((x(1+δ1)+y(1+δ2))(1+δ3),\text{fl}(x+y):=\text{round}(x)\oplus\text{round}(y)=((x(1+\delta_1)+y(1+\delta_2))(1+\delta_3),
where δi<ε2|\delta_i|<\frac{\varepsilon}{2}, i=1,2,3i=1,2,3.
FEA:
fl(x+y)=x+y+x(δ1+δ3+δ1δ3)+y(δ2+δ3+δ2δ3).\text{fl}(x+y)=x+y+x(\delta_1+\delta_3+\delta_1\delta_3)+y(\delta_2+\delta_3+\delta_2\delta_3).
The absolute error is
fl(x+y)(x+y)(x+y)(ε+ε24).|\text{fl}(x+y)-(x+y)|\le(|x|+|y|)\left(\varepsilon+\frac{\varepsilon^2}{4}\right).
The relative error is:
fl(x+y)(x+y)x+y(x+y)(ε+ε24)x+y.\left|\frac{\text{fl}(x+y)-(x+y)}{x+y}\right|\le\frac{(|x|+|y|)\left(\varepsilon+\frac{\varepsilon^2}{4}\right)}{|x+y|}.
BEA:
fl(x+y)=x(1+δ1)(1+δ3)+y(1+δ2)(1+δ3).\text{fl}(x+y)=x(1+\delta_1)(1+\delta_3)+y(1+\delta_2)(1+\delta_3).
Thus the relative error for each term is less or equal to ε+ε24\varepsilon+\frac{\varepsilon^2}{4}.
Hence the sum of two floating point numbers is backwards stable.

Well-conditioned problems may have unstable algorithms. For stability, each step has to be well conditioned. Some ill-conditioned steps produce an unstable algorithm. Ill-conditioned problems cannot be reliably solved with a stable algorithm.

Example. Consider evaluating f(x)=1+x1f(x)=\sqrt{1+x}-1 for xx close to zero. The relative condition number is:
K(x)=xf(x)f(x)=x21+x(1+x1)K(x)=\left|\frac{xf'(x)}{f(x)}\right|=\frac{x}{2\sqrt{1+x}(\sqrt{1+x}-1)}
=x(1+x+1)21+x(1+x1)(1+x+1)=1+x+121+x,=\frac{x(\sqrt{1+x}+1)}{2\sqrt{1+x}(\sqrt{1+x}-1)(\sqrt{1+x}+1)}=\frac{\sqrt{1+x}+1}{2\sqrt{1+x}},
and K(0)=1K(0)=1.
Consider the following 3 steps;

  1. t1:=1+xt_1:=1+x, well-conditioned, xx close to 00.
  2. t2:=t1t_2:=\sqrt{t_1}, relatively well-conditioned, also absolutely well conditioned, because t1t_1 is close to 11.
  3. t3:=t21t_3:=t_2-1, ill-conditioned, relative condition number of this step: K3(t2)=t2t21K_3(t_2)=\left|\frac{t_2}{t_2-1}\right|, which becomes unbounded for t2t_2 close to 11!
    On the other hand, the problem is well-conditioned. Solve it by writing:
    f(x)=1+x1=(1+x+1)(1+x1)1+x+1=x1+x+1,f(x)=\sqrt{1+x}-1=\frac{(\sqrt{1+x}+1)(\sqrt{1+x}-1)}{\sqrt{1+x}+1}=\frac{x}{\sqrt{1+x}+1},
    which can be evaluated directly close to zero.

Numerical differentiation

Recall Taylor’s theorem, for a twice differentiable function f:RRf:\mathbb{R}\to\mathbb{R},
f(x)=f(x0)+f(x0)(xx0)+12(xx0)2f(ξ),f(x)=f(x_0)+f'(x_0)(x-x_0)+\frac{1}{2}(x-x_0)^2 f''(\xi),
for any x0,xRx_0,x\in\mathbb{R}, where ξ[x0,x]\xi\in [x_0,x].

What is that ξ\xi (= “xi”)? Under certain assumptions elementary functions have their series expansions. If the series is truncated, we have the Taylor polynomial. However, the residual has an explicit expression but due to application of an intermediate value theorem, the exact location of the point ξ\xi is not known, i.e. “generic”.

By setting x:=z+hx:=z+h, x0:=zx_0:=z, we obtain the useful equivalent formula
f(z+h)=f(z)+f(z)h+12f(ξ)h2,f(z+h)=f(z)+f'(z)h+\frac{1}{2} f''(\xi)h^2,
for every z,hRz,h\in\mathbb{R}, ξ[z,z+h]\xi\in [z,z+h].

Rate of convergence (QQ-convergence)

Let (xk)(x_k) be an infinite sequence of real numbers. Let sk:=suplkxls_k:=\sup_{l\ge k}x_l, kNk\in\mathbb{N}, be the supremum (i.e., the lowest upper bound) of the tail (that is, large indices lkl\ge k) of (xk)(x_k). Define the lim sup\limsup (limes superior) as
lim supkxk:=limksk[,+].\limsup_{k\to\infty}x_k:=\lim_{k\to\infty}s_k\in[-\infty,+\infty].
Other than a limit, it always exists, but can be ±\pm\infty. If (xk)(x_k) is bounded, the lim sup\limsup is the largest limit of a converging subsequence. If limkxk(,)\lim_{k\to\infty} x_k\in (-\infty,\infty) exists, then limkxk=lim supkxk\lim_{k\to\infty}x_k=\limsup_{k\to\infty}x_k. The opposite is not true.

Examples. (1) xk:=(1)kx_k:=(-1)^k, then sk=1s_k=1, and lim supkxk=1\limsup_{k\to\infty}x_k=1.
(2) xk=sin(k)x_k=\sin(k), then sk=1s_k=1, and lim supkxk=1\limsup_{k\to\infty}x_k=1.

Assume that limkxk=x\lim_{k\to\infty}x_k=x and that there exists some large index MNM\in\mathbb{N} such that xkxx_k\not=x for all kMk\ge M. Then we define the following quantity for p0p\ge 0
C(p):=lim supkxk+1xxkxp.C(p):=\limsup_{k\to\infty}\frac{|x_{k+1}-x|}{|x_k-x|^p}.
We observe that C(p)<C(p^{*})<\infty for some p>0p^{*}> 0 implies C(p)=0C(p)=0 for every 0p<p0\le p<p^{*}. If C(p)>0C(p^{*})>0 for some p>0p^{*}> 0 then C(p)=C(p)=\infty for any p>pp>p^{*}.

Proof. By the submultiplicative property of lim sup\limsup,
C(p)=lim supkxk+1xxkxp=lim supk[xk+1xxkxpxkxpp]C(p)=\limsup_{k\to\infty}\frac{|x_{k+1}-x|}{|x_k-x|^p}=\limsup_{k\to\infty}\left[\frac{|x_{k+1}-x|}{|x_k-x|^{p^{*}}}|x_k-x|^{p^{*}-p}\right]
(lim supkxk+1xxkxp)(lim supkxkxpp)=C(p){0if    p<p,if    p>p.\le\left(\limsup_{k\to\infty}\frac{|x_{k+1}-x|}{|x_k-x|^{p^{*}}}\right)\cdot\left(\limsup_{k\to\infty}|x_k-x|^{p^{*}-p}\right)=C(p^{*})\cdot \begin{cases}0&\text{if}\;\;p<p^{*},\\\infty&\text{if}\;\;p>p^{*}.\end{cases}
\Box

Thus, there exists a (possibly infinite) pp^{*} such that
C(p)={0if    0p<p,C(p)if    p=p,if    p>p.C(p)=\begin{cases}0&\text{if}\;\;0\le p<p^{*},\\ C(p^{*})&\text{if}\;\;p=p^{*},\\\infty &\text{if}\;\;p>p^{*}.\end{cases}
The number pp^{*} is called order of convergence for the sequence (xk)(x_k) and determines the rate of convergence as follows:

When working with convergence estimates it is often useful to use the following approximation:
xk+1xCxkxp|x_{k+1}-x|\approx C|x_k-x|^{p^{*}}
for some constant C>0C>0, not necessarily C(p)C(p^{*}).
Here, it is useful to look at the logarithmic behavior:
log(xk+1x)log(Cxkxp)=log(C)+log(xkxp)=log(C)+plog(xkx).\log(|x_{k+1}-x|)\approx\log\left(C|x_k-x|^{p^{*}}\right)=\log(C)+\log\left(|x_k-x|^{p^{*}}\right)=\log(C)+p^{*}\log(|x_k-x|).

The rate of convergence can be used interchangeably with the order of convergence. However, there is some caution necessary, as different authors use different terminology here. Usually, the order of convergence always refers to the same thing, namely, the pp^{*}-exponent in the denominator of the limit defining the order of convergence. Most confusingly, some authors call the order of convergence “rate of convergence”, as e.g. here. The English Wikipedia article calls it the order of convergence, whereas here the rate of convergence is the constant in the definition, which also determines the speed of convergence, together with the order of convergence. So, please always check the context, as the use of the terminology should be clear from it. If there is no definition, try to figure out what is meant in each text. As a rule of thumb: The “order of convergence” is a unique terminology in numerical analysis. The “rate of convergence” can mean at least two different things. I will use both words for the same thing, but will try to make clear what I mean from case to case. In any case, to be sure, use “order of convergence”. My PhD advisor usually said that in mathematics “it’s all hollow words” (meaning that one should check the definition).

Landau’s little oo- and big OO-notation

Copied from Wikibooks under a Creative Commons BY-SA 4.0 license.

The Landau notation is an amazing tool applicable in all of real analysis. The reason it is so convenient and widely used is because it underlines a key principle of real analysis, namely ‘‘estimation’’. Loosely speaking, the Landau notation introduces two operators which can be called the “order of magnitude” operators, which essentially compare the magnitude of two given functions.

The “little-oo

The “little-oo” provides a function that is of lower order of magnitude than a given function, that is the function o(g(x))o(g(x)) is of a lower order than the function g(x)g(x). Formally,

Definition.
Let ARA\subseteq\mathbb{R} and let cRc\in\mathbb{R}.
Let f,g:ARf,g:A\to\mathbb{R}.

If limxcf(x)g(x)=0\lim_{x\to c}\frac{f(x)}{g(x)}=0 then we say that

"As xcx\to c, f(x)=o(g(x))f(x)=o(g(x))"

Examples.

The “Big-OO

The “Big-OO” provides a function that is at most the same order as that of a given function, that is the function O(g(x))O(g(x)) is at most the same order as the function g(x)g(x). Formally,

Definition.
Let ARA\subseteq\mathbb{R} and let cRc\in\mathbb{R}

Let f,g:ARf,g:A\to\mathbb{R}

If there exists M>0M>0 such that limxcf(x)g(x)<M\lim_{x\to c}\left| \frac{f(x)}{g(x)}\right| <M then we say that

"As xcx\to c, f(x)=O(g(x))f(x)=O(g(x))"

Examples.

Applications

We will now consider few examples which demonstrate the power of this notation.

Differentiability

Let f:URRf: \mathcal{U} \subseteq \mathbb{R} \to\mathbb{R} and x0Ux_0 \in \mathcal{U}.

Then ff is differentiable at x0x_0 if and only if

There exists a λR\lambda \in\mathbb{R} such that as xx0x\to x_0, f(x)=f(x0)+λ(xx0)+o(xx0)f(x) = f(x_0) + \lambda(x-x_0)+o\left( |x-x_0|\right).

Mean Value Theorem

Let f:[a,x]Rf:[a,x]\to\mathbb{R} be differentiable on [a,b][a,b]. Then,

As xax\to a, f(x)=f(a)+O(xa)f(x)=f(a)+O(x-a).

Taylor’s Theorem

Let f:[a,x]Rf:[a,x]\to\mathbb{R} be nn-times differentiable on [a,b][a,b]. Then,

As xax\to a, f(x)=f(a)+(xa)f(a)1!+(xa)2f(a)2!++(xa)n1f(n1)(a)(n1)!+O((xa)n)f(x)=f(a)+\tfrac{(x-a)f'(a)}{1!}+\tfrac{(x-a)^2f''(a)}{2!}+\ldots+\tfrac{(x-a)^{n-1}f^{(n-1)}(a)}{(n-1)!}+O\left( (x-a)^n\right).

Finding roots of functions and fixed points

Let f:RRf:\mathbb{R}\to\mathbb{R} be continuous. We are interested in methods for finding zeros, that is, roots of ff, in other words, xRx\in\mathbb{R}, such that f(x)=0f(x)=0.

Definition. If f:RnRf:\mathbb{R}^n\to\mathbb{R}, nNn\in\mathbb{N} is kk-times continuously differentiable, then we write fCk(Rn)f\in C^k(\mathbb{R}^n) or just fCkf\in C^k. For k=0k=0, fC0f\in C^0 or fC(Rn)f\in C(\mathbb{R}^n) or fC([a,b])f\in C([a,b]) just means that ff is assumed to be continuous.

Bisection method

The intermediate value theorem for continuous functions implies that x1<x<x2x_1<x<x_2 with f(x)=0f(x)=0 exists if f(x1)f(x2)<0f(x_1) f(x_2)<0, i.e., there is a sign change. The bisection method is based on halving the interval such that the sign condition is preserved. Note that, in principle, we have to look for intervals [x1,x2][x_1,x_2].

Let us analyze the convergence rate. Let [a,b][a,b] be an interval. After kk steps the interval of analysis has length ba2k\frac{b-a}{2^k} which converges to zero for kk\to\infty. Let us look in a neighborhood of radius δ>0\delta>0, so that
ba2k2δ2k+1baδklog2(baδ)1.\frac{b-a}{2^k}\le 2\delta\quad\Leftrightarrow\quad 2^{k+1}\ge \frac{b-a}{\delta}\quad\Leftrightarrow\quad k\ge \log_2\left(\frac{b-a}{\delta}\right)-1.
Every step reduces the error by factor 12\frac{1}{2}. The convergence rate is thus linear.

Newton’s method

Assume that fC1f\in C^1. For an initial value x0x_0, consider the iteration
xk+1=xkf(xk)f(xk)k=0,1,.x_{k+1}=x_k-\frac{f(x_k)}{f'(x_k)}\quad k=0,1,\ldots.

Heuristics. If f(x)=0f(x_\ast)=0 and fC2f\in C^2, by the Taylor’s expansion,
0=f(x)=f(x0)+(xx0)f(x0)+(xx0)22f(ξ)0=f(x_\ast)=f(x_0)+(x_\ast-x_0)f'(x_0)+\frac{(x_\ast-x_0)^2}{2}f''(\xi)
for ξ[x0,x]\xi\in [x_0,x_\ast], and upon neglecting the 2nd order term,
xx0f(x0)f(x0).x_\ast \approx x_0-\frac{f(x_0)}{f'(x_0)}.

Theorem. If fC2f\in C^2 and x0x_0 is sufficiently good (i.e. close to the root xx_\ast) and if f(x)0f'(x_\ast)\not=0, then Newton’s method converges quadratically.

Proof. By Taylor’s expansion, it follows that
x=xkf(xk)f(xk)(xxk)22f(ξk)f(xk)x_\ast=x_k-\frac{f(x_k)}{f'(x_k)}-\frac{(x_\ast-x_k)^2}{2}\frac{f''(\xi_k)}{f'(x_k)}
for some ξk[x,xk]\xi_k\in [x_\ast,x_k]. Take xk+1x_{k+1} from the method and subtract,
xk+1x=(xxk)2f(ξk)2f(xk)undefinedD.x_{k+1}-x_\ast=(x_\ast-x_k)^2\underbrace{\frac{f''(\xi_k)}{2f'(x_k)}}_{\le D}.
In other words,
xk+1xDxkx2,|x_{k+1}-x_\ast|\le D |x_k-x^\ast|^2,
as kk\to\infty and thus xkxx_k\to x_\ast.
Hence the method is quadratic. Note that f(xk)f'(x_k) does not vanish by continuity if xkx_k is close to xx_\ast. \Box

What happens if f(x)=0f'(x_\ast)=0?
xk+1x=(xxk)2f(ξk)2f(xk)undefined0x_{k+1}-x_\ast=(x_\ast-x_k)^2\frac{f''(\xi_k)}{2\underbrace{f'(x_k)}_{\to0}}
as kk\to\infty.
By Taylor’s expansion (η\eta=“eta”):
f(xk)=f(x)undefined=0+(xkx)f(ηk)=(xkx)f(ηk)f'(x_k)=\underbrace{f'(x_\ast)}_{=0}+(x_k-x_\ast)f''(\eta_k)=(x_k-x_\ast)f''(\eta_k)
for some ηk[x,xk]\eta_k\in [x_\ast,x_k], and hence
xk+1x=f(ηk)=(xkx)f(ηk)(xkx).x_{k+1}-x_\ast=f''(\eta_k)=(x_k-x_\ast)f''(\eta_k)(x_k-x_\ast).
The method has degenerated to a linear method!

Example. f(x)=x2f(x)=x^2, f(x)=2xf'(x)=2x. Newton:
xk+1=xkxk22xk=12xk.x_{k+1}=x_k-\frac{x^2_k}{2x_k}=\frac{1}{2}x_k.

Secant method

Sometimes it can be difficult or computationally expensive to compute the derivative f(xk)f'(x_k). Newton 's method can be adapted by approximating the derivative by the differential quotient. The secant method is the following two-step recursive algorithm.
xk+1=xkf(xk)(xkxk1)f(xk)f(xk1),k=1,2,x_{k+1}=x_k-\frac{f(x_k)(x_k-x_{k-1})}{f(x_k)-f(x_{k-1})},\quad k=1,2,\ldots
with two distinct starting points x0x1x_0\not=x_1. The convergence rate is 1+521.62\frac{1+\sqrt{5}}{2}\approx 1.62, the golden ratio.

Fixed point iteration

Definition. A point xRx\in\mathbb{R} is called a fixed point of φ:RR\varphi:\mathbb{R}\to\mathbb{R} if φ(x)=x\varphi(x)=x.

We could for instance use Newton’s method to find fixed points by setting f(x):=φ(x)xf(x):=\varphi(x)-x.

Banach’s Fixed Point Theorem. Suppose that φ\varphi is a contraction, that is, there exists a constant L<1L<1 such that
φ(x)φ(y)Lxy|\varphi(x)-\varphi(y)|\le L|x-y|
for all x,yRx,y\in\mathbb{R}. Then there exists a unique fixed point xRx_\ast\in\mathbb{R} of φ\varphi, i.e., φ(x)=x\varphi(x_\ast)=x_\ast, and the fixed point iteration φn:=φφundefinedn-times\varphi_n:=\underbrace{\varphi\circ\ldots\circ\varphi}_{n\text{-times}} satisfies limnφn(x0)=x\lim_{n\to\infty}\varphi_n(x_0)=x_{\ast} for any starting point x0Rx_0\in\mathbb{R}. The convergence rate is at least linear.

Proof. We prove that the sequence {φk(x0)}k=0={xk}k=0\{\varphi_k(x_0)\}_{k=0}^{\infty}=\{x_k\}_{k=0}^{\infty} is a Cauchy sequence. Let k>jk>j. Then, by the triangle inequality,
xkxjxkxk1+xk1xk2++xj+1xjundefined(kj)-summands.|x_k-x_j|\le\underbrace{|x_k-x_{k-1}|+|x_{k-1}-x_{k-2}|+\ldots+|x_{j+1}-x_j|}_{(k-j)\text{-summands}}.
Furthermore,
xmxm1=φ(xm1)φ(xm2)Lxm1xm2Lm1x1x0.|x_m-x_{m-1}|=|\varphi(x_{m-1})-\varphi(x_{m-2})|\le L|x_{m-1}-x_{m-2}|\le L^{m-1}|x_1-x_0|.
Hence, by the geometric series,
xkxjLj1Lkj1Lx1x0.|x_k-x_j|\le L^j\frac{1-L^{k-j}}{1-L}|x_1-x_0|.
If k>Nk>N, j>Nj>N, then
xkxjLN11Lx1x00|x_k-x_j|\le L^N\frac{1}{1-L}|x_1-x_0|\to 0
as NN\to\infty, which proves that {xk}\{x_k\} is a Cauchy sequence. The linear convergence rate follows also from this estimate.
The existence of a fixed point follows from the continuity of φ\varphi (as contractions are uniformly continuous, in fact, even Lipschitz continuous) as follows.
x=limkxk=limkxk+1=limkφ(xk)=φ(limkxk)=φ(x).x_\ast=\lim_{k\to\infty}x_k=\lim_{k\to\infty}x_{k+1}=\lim_{k\to\infty}\varphi(x_k)=\varphi(\lim_{k\to\infty}x_k)=\varphi(x_\ast).
\Box

Theorem. Assume that φCp\varphi\in C^p for p1p\ge 1. Furthermore, assume that has a fixed point xx_\ast and assume that
φ(x)=φ(x)==φ(p1)(x)=0\varphi'(x_\ast)=\varphi''(x_\ast)=\ldots=\varphi^{(p-1)}(x_\ast)=0 for p2p\ge 2 and
G(x)<1G'(x_\ast)<1 if p=1p=1. Then the fixed point sequence {φk(x0)}\{\varphi_k(x_0)\} converges to xx_\ast at least with rate pp, provided that the starting point x0x_0 is sufficiently close to xx_\ast. If, in addition, φ(p)(x)0\varphi^{(p)}(x_\ast)\not=0, then the rate of convergence is precisely pp.

Proof. First note that by Banach’s fixed point theorem the limit indeed converges to xx_\ast for suitable starting points x0x_0. By the Taylor expansion,
xk+1x=φ(xk)φ(x)=l=1p1φ(l)(x)l!(xkx)l+G(p)(ξk)p!(xkx)px_{k+1}-x_\ast=\varphi(x_k)-\varphi(x_\ast)=\sum_{l=1}^{p-1}\frac{\varphi^{(l)}(x_\ast)}{l!}(x_k-x_\ast)^l+\frac{G^{(p)}(\xi_k)}{p!}(x_k-x_\ast)^p
for some ξk\xi_k between xx_\ast and xkx_k. The sum will be left empty for the case p=1p=1. Since φ(l)(x)=0\varphi^{(l)}(x_\ast)=0 for 1lp11\le l\le p-1, we get that
xk+1x=φ(p)(ξk)p!xkxp|x_{k+1}-x_\ast|=\frac{|\varphi^{(p)}(\xi_k)|}{p!} |x_k-x_\ast|^p.
By continuity, there exists C>0C>0 (with C<1C<1 for p=1p=1) such that
φ(p)(ξk)p!C\frac{|\varphi^{(p)}(\xi_k)|}{p!}\le C for ξk\xi_k sufficiently close to xx_\ast (that is, for sufficiently large kk). Thus,
xk+1xCxkxp|x_{k+1}-x_\ast|\le C|x_k-x_\ast|^p
for large kk, and thus the rate of convergence is at least pp. Note that for p=1p=1, this also proves convergence by
xk+1x<φ(ξk)undefined<1xkx.|x_{k+1}-x_\ast|<\underbrace{|\varphi(\xi_k)|}_{<1}|x_k-x_\ast|.
If φ(p)0\varphi^{(p)}\not=0, then by continuity, there exists K>0K>0 such that
φ(p)(ξk)p!K\frac{|\varphi^{(p)}(\xi_k)|}{p!}\ge K
for ξK\xi_K sufficiently close to xx_\ast. Thus
xk+1xKxkxp|x_{k+1}-x_\ast|\ge K|x_k-x_\ast|^p
which implies that the rate of convergence cannot be higher than pp. Thus the rate of convergence is precisely pp. \Box

Note. From the above proof, we expect that close to the fixed point xx_\ast
xk+1xφ(p)(x)p!xkxp,|x_{k+1}-x_\ast|\approx \frac{|\varphi^{(p)}(x_\ast)|}{p!} |x_k-x_\ast|^p,
when
φ(x)=φ(x)==φ(p1)(x)=0,\varphi'(x_\ast)=\varphi''(x_\ast)=\ldots=\varphi^{(p-1)}(x_\ast)=0,
but φ(p)(x)0\varphi^{(p)}(x_\ast)\not=0.

Polynomial interpolation

Idea. Approximate a function f:RRf:\mathbb{R}\to\mathbb{R} over [a,b][a,b] by a polynomial pp such that in distinct data points (xi,yi)(x_i,y_i), i=0,1,,ni=0,1,\ldots, n, the approximation is exact, that is,
f(xi)=yi=p(xi),for alli=0,1,,n.f(x_i)=y_i=p(x_i),\quad\text{for all}\quad i=0,1,\ldots, n.
We may call xix_i node and yiy_i value.
We need at least 22 data points. We usually just assume that xixjx_i\not=x_j for iji\not=j.

Note. Interpolation polynomials are not per se unique, for instance the data {(1,1),(1,1)}\{(-1,1),(1,1)\} can be interpolated by
p(x)=1p(x)=1, q(x)=x2q(x)=x^2, or r(x)=x4x2+1r(x)=x^4-x^2+1. However, we will see later that pp is the unique interpolation polynomial with degp1=n\deg p\le 1= n.

Example. (1,2)(1,2), (2,3)(2,3), (3,6)(3,6), as data set {(xi,yi)   ⁣:  i=0,1,2}\{(x_i,y_i)\;\colon\;i=0,1,2\} on the interval [1,3][1,3].
We are looking for a polynomial p2(x)=j=02cjxjp_2(x)=\sum_{j=0}^2 c_j x^j, which is chosen to be 2nd order, because we have 33 data points and 33 unknown coefficients.
We can formulate the problem in matrix form:
(1x0x021x1x121x2x22)(c0c1c2)=(y0y1y2),\begin{pmatrix}1 & x_0 & x_0^2\\ 1 & x_1 & x_1^2\\ 1 &x_2 & x_2^2\end{pmatrix}\cdot \begin{pmatrix}c_0\\c_1\\c_2\end{pmatrix}=\begin{pmatrix}y_0\\y_1\\y_2\end{pmatrix},
which is a so-called Vandermonde matrix which has determinant
det(1x0x021x1x121x2x22)=i<j(xjxi)0,\det\begin{pmatrix}1 & x_0 & x_0^2\\ 1 & x_1 & x_1^2\\ 1 &x_2 & x_2^2\end{pmatrix}=\prod_{i<j}(x_j-x_i)\not=0, and is thus invertible. Here,
(111124139)(c0c1c2)=(236).\begin{pmatrix}1 & 1 & 1\\ 1 & 2 & 4\\ 1 &3 & 9\end{pmatrix}\cdot \begin{pmatrix}c_0\\c_1\\c_2\end{pmatrix}=\begin{pmatrix}2\\3\\6\end{pmatrix}.
As a result, c0=3c_0=3, c1=2c_1=-2, and c2=1c_2=1, and thus,
p2(x)=x22x+3.p_2(x)=x^2-2x+3.

The computational complexity of solving the linear system is O(n3)O(n^3). We used the natural basis for the polynomials.
What would be the ideal basis?
Definition. (Lagrange basis polynomials) Suppose that xixjx_i\not=x_j if iji\not=j. We call
ϕi(x):=j=0ijnxxjxixj\phi_i(x):=\prod_{\substack{j=0\\ i\not=j}}^n\frac{x-x_j}{x_i-x_j}
the iith Lagrange basis polynomial.
The Lagrange interpolation polynomial is given by
p(x):=i=0nyiφi(x).p(x):=\sum_{i=0}^n y_i \varphi_i(x).

Clearly,
φi(xj)=δi,j:={1    if    i=j,0    if    ij.\varphi_i(x_j)=\delta_{i,j}:=\begin{cases}1\text{\;\;if\;\;}i=j,\\0\text{\;\;if\;\;}i\not=j.\end{cases}

Example. (1,2)(1,2), (2,3)(2,3), (3,6)(3,6):
φ0(x)=(x2)(x3)(12)(13)\varphi_0(x)=\frac{(x-2)(x-3)}{(1-2)(1-3)}
φ1(x)=(x1)(x3)(21)(23)\varphi_1(x)=\frac{(x-1)(x-3)}{(2-1)(2-3)}
φ2(x)=(x1)(x2)(31)(32)\varphi_2(x)=\frac{(x-1)(x-2)}{(3-1)(3-2)}
p2(x)=2φ0(x)+3φ1(x)+6φ2(x)=x22x+3.p_2(x)=2\varphi_0(x)+3\varphi_1(x)+6\varphi_2(x)=x^2-2x+3.

Evaluating the Lagrange polynomials has the computational complexity O(n2)O(n^2).

Newton’s interpolation

Idea. Extend the natural basis:
1,xx0,(xx0)(xx1),,j=0n1(xxj).1,\quad x-x_0,\quad (x-x_0)(x-x_1),\quad\ldots,\quad \prod_{j=0}^{n-1}(x-x_j).

Definition. Define Newton’s interpolation polynomials by
pn(x)=a0+a1(xx0)++anj=0n1(xxj)p_n(x)=a_0+a_1(x-x_0)+\ldots+a_n\prod_{j=0}^{n-1} (x-x_j)
in such a way that pn(xi)=yip_n(x_i)=y_i.
Clearly,
p(x0)=y0a0=y0,p(x_0)=y_0\quad\Rightarrow\quad a_0=y_0,
and
p(x1)=a0+a1(x1x0)=y1a1=y1a0x1x0.p(x_1)=a_0+a_1(x_1-x_0)=y_1\quad \Rightarrow \quad a_1=\frac{y_1-a_0}{x_1-x_0}.
More generally, we have the lower triangular linear system
(101x1x001x1x0(x2x0)(x2x1)01x1x0(x2x0)(x2x1)j=0n1(xnxj))(a0a1a2an)=(y0y1y2yn).\begin{pmatrix} 1& 0 &\cdots&&&&\\ 1& x_1-x_0 & 0&\cdots&&&\\ 1& x_1-x_0 & (x_2-x_0)(x_2-x_1)&0&\cdots&&\\ \vdots &\vdots & \vdots&\vdots&&&\\ 1& x_1-x_0 & (x_2-x_0)(x_2-x_1)&&\cdots&&\prod_{j=0}^{n-1}(x_n-x_j) \end{pmatrix}\begin{pmatrix}a_0\\a_1\\a_2\\\vdots\\a_n\end{pmatrix}=\begin{pmatrix}y_0\\y_1\\y_2\\\vdots\\y_n\end{pmatrix}.

Example. p2(x)=a0+a1(x1)+a2(x1)(x2)p_2(x)=a_0+a_1(x-1)+a_2(x-1)(x-2) with the system
(100110122)(a0a1a2)=(236)\begin{pmatrix}1&0&0\\ 1& 1&0\\ 1&2&2\end{pmatrix}\cdot\begin{pmatrix}a_0\\a_1\\a_2\end{pmatrix}=\begin{pmatrix}2\\3\\6\end{pmatrix}
and hence a0=2a_0=2, a1=1a_1=1, and a2=1a_2=1 which yields p2(x)=x22x+3p_2(x)=x^2-2x+3.

Uniqueness

Theorem. Interpolation polynomials with n+1n+1 nodes xix_i, i=0,1,,ni=0,1,\ldots, n are unique in the class of polynomials qq with degqn\deg q\le n.

Proof. (Idea). pnp_n has at most nn roots. Let pnp_n and qnq_n be two interpolating polynomials for the same set of data. Then
pn(xj)=qn(xj)=0,for anyj=0,1,,n.p_n(x_j)=q_n(x_j)=0,\quad\text{for any}\quad j=0,1,\ldots,n.
Hence pnqnp_n-q_n has n+1n+1 distinct roots. As deg(pnqn)max(degpn,degqn)=n\deg(p_n-q_n)\le \max(\deg p_n,\deg q_n)=n, the only polynomial with n+1n+1 roots is the polynomial which is constantly zero. Hence,
pn=qn.p_n=q_n.
We have used the corollary to the fundamental theorem of algebra which states that every non-constant real polynomial of degree mm has at most mm zeros. \Box

Divided differences

Let pp be a Newton interpolation polynomial
p(x)=a0+a1(x1x0)+a2(xx2)(xx1)++anj=0n=1(xxj).p(x)=a_0+a_1(x_1-x_0)+a_2(x-x_2)(x-x_1)+\ldots+a_n\prod_{j=0}^{n=1}(x-x_j).

Definition. The divided difference of order kk, denoted by f[x0,x1,,xk]f[x_0,x_1,\ldots,x_k], is defined as the aka_k-coefficient of the Newton interpolation polynomial with data yi=f(xi)y_i=f(x_i), in other words,
f[x0,x1,,xk]:=ak.f[x_0,x_1,\ldots,x_k]:=a_k.

Theorem.
f[x0,x1,,xk]=f[x1,,xk]f[x0,,xk1]xkx0.f[x_0,x_1,\ldots,x_k]=\frac{f[x_1,\ldots,x_k]-f[x_0,\ldots,x_{k-1}]}{x_k-x_0}.

Note. The recursion terminates because f[xi]=yif[x_i]=y_i.

Example. ** (1,2)(1,2), (2,3)(2,3), (3,6)(3,6), p2(x)=x22x+3p_2(x)=x^2-2x+3, Newton: a0=2a_0=2, a1=1a_1=1, a2=1a_2=1.
f[x0]=2=a0f[x_0]=2=a_0, f[x1]=3f[x_1]=3, f[x2]=6f[x_2]=6, f[x0,x1]=3221=1=a1f[x_0,x_1]=\frac{3-2}{2-1}=1=a_1, f[x1,x2]=6332=3f[x_1,x_2]=\frac{6-3}{3-2}=3, f[x0,x1,x2]=3131=1=a2f[x_0,x_1,x_2]=\frac{3-1}{3-1}=1=a_2.

Why does this work?

One point: f[xj]=fj=yjf[x_j]=f_j=y_j.
Two points: f[xi,xj]=f[xj]f[xi]xjxif[x_i,x_j]=\frac{f[x_j]-f[x_i]}{x_j-x_i} which is the line spanned by the two points (xi,yi)(x_i,y_i) and (xj,yj)(x_j,y_j), i.e.,
yyi=yjyixjxi(xxi).y-y_i=\frac{y_j-y_i}{x_j-x_i}(x-x_i).

Proof. (Idea). We have three interpolation polynomials pp, qq, rr, where degp=k\deg p=k, degq=degr=k1\deg q=\deg r=k-1. pp interpolates at x0,x1,,xkx_0,x_1,\ldots,x_k, qq interpolates at x0,x1,,xk1x_0,x_1,\ldots,x_{k-1}, and rr interpolates at x1,,xkx_1,\ldots,x_k.
Claim.
p(x)=q(x)+xx0xkx0(r(x)q(x))undefined=0  for  xi.p(x)=q(x)+\frac{x-x_0}{x_k-x_0}\underbrace{(r(x)-q(x))}_{=0\text{\;for\;}x_i}.
x0x_0: p(x0)=q(x0)=f0p(x_0)=q(x_0)=f_0.
x1,,xk1x_1,\ldots,x_{k-1}: p(xi)=q(xi)p(x_i)=q(x_i).
xkx_k: p(xk)=q(xk)+xkx0xkx0undefined=1(r(xk)q(xk)=r(xk)p(x_k)=q(x_k)+\underbrace{\frac{x_k-x_0}{x_k-x_0}}_{=1}(r(x_k)-q(x_k)=r(x_k).
The highest order term has the coefficient
p(k)(x)k!=rk1qk1xkx0,\frac{p^{(k)}(x)}{k!}=\frac{r_{k-1}-q_{k-1}}{x_k-x_0},
where rk1=f[x1,x2,,xk]=r(k1)(k1)!r_{k-1}=f[x_1,x_2,\ldots,x_k]=\frac{r^{(k-1)}}{(k-1)!} and qk1=f[x0,x2,,xk1]=q(k1)(k1)!q_{k-1}=f[x_0,x_2,\ldots,x_{k-1}]=\frac{q^{(k-1)}}{(k-1)!},
which can be proved by the general Leibniz rule. \Box

Interpolation error

Assume that fCn+1f\in C^{n+1}. We are interested in the local (pointwise) error (residual)
R(x):=f(x)p(x),R(x):=f(x)-p(x),
where pp is the interpolation polynomial with degp=n\deg p=n.
Fix data (xi,yi)(x_i,y_i), yi=f(xi)y_i=f(x_i), i=0,1,,ni=0,1,\ldots, n, xixjx_i\not=x_j, iji\not=j. Let xx' be an distinct extra point.
Define an auxiliary function:
h(x)=f(x)p(x)cw(x),h(x)=f(x)-p(x)-cw(x),
where
w(x)=j=0n(xxj)w(x)=\prod_{j=0}^n(x-x_j)
and
c=f(x)p(x)w(x).c=\frac{f(x')-p(x')}{w(x')}.
We find that
h(xi)=0fori=0,1,n.h(x_i)=0\quad \text{for}\quad i=0,1,\ldots n.
Furthermore,
h(x)=f(x)p(x)f(x)p(x)w(x)w(x)=0.h(x')=f(x')-p(x')-\frac{f(x')-p(x')}{w(x')}w(x')=0.
Hence hh has n+2n+2 distinct zeros. By Rolle’s theorem (see Differential and Integral Calculus 1), h(n+1)h^{(n+1)} will have at least one zero. Let’s call this point ξ\xi.
h(n+1)(x)=f(n+1)p(n+1)(x)undefined=0cw(n+1)(x)=f(n+1)(x)c(n+1)!h^{(n+1)}(x)=f^{(n+1)}-\underbrace{p^{(n+1)}(x)}_{=0}-cw^{(n+1)}(x)=f^{(n+1)}(x)-c(n+1)!
Hence
h(n+1)(ξ)=f(n+1)(ξ)c(n+1)!=0h^{(n+1)}(\xi)=f^{(n+1)}(\xi)-c(n+1)!=0
and thus
c=f(n+1)(ξ)(n+1)!.c=\frac{f^{(n+1)}(\xi)}{(n+1)!}.

We have proved that:

Theorem. If fCn+1f\in C^{n+1}, the residual R=fpR=f-p at xx has the form
R(x)=f(n+1)(ξ)(n+1)!j=0n(xxj).R(x)=\frac{f^{(n+1)}(\xi)}{(n+1)!}\prod_{j=0}^n(x-x_j).

Notice that, in general, RR is not a polynomial, as ξ=ξ(x)\xi=\xi(x) depends nonlinearly on xx.

Note. The constant cc is a divided difference:
f[x0,x1,,xn,x]=1(n+1)!f(n+1)(ξ(x)),f[x_0,x_1,\ldots,x_n,x]=\frac{1}{(n+1)!}f^{(n+1)}(\xi(x)),
which follows from the formula for RR, R(n+1)=f(n+1)R^{(n+1)}=f^{(n+1)} and R(xi)=f(xi)R(x_i)=f(x_i) for i=0,1,,ni=0,1,\ldots, n.

Piecewise interpolation

Setup. Fix a bounded interval [a,b][a,b] and a step size / mesh
h:=banh:=\frac{b-a}{n}
for some nNn\in\mathbb{N}, where nn is the number of subintervals.

Idea. Approximate the function on each subinterval using some low order interpolation polynomial such that the interpolation function is exact at the nodes.

Piecewise linear case:
i(x)=f(xi1)xxixi1xi+f(xi)xxi1xixi1,x[xi1,xi].\ell_i(x)=f(x_{i-1})\frac{x-x_i}{x_{i-1}-x_i}+f(x_i)\frac{x-x_{i-1}}{x_i-x_{i-1}},\quad x\in [x_{i-1},x_i].

By the residual formula on each subinterval R(x)=f(n+1)(ξ)(n+1)!j=0n(xxj)R(x)=\frac{f^{(n+1)}(\xi)}{(n+1)!}\prod_{j=0}^n(x-x_j), we get the interpolation error
f(x)i(x)=f(ξ)2!(xxi1)(xxi),f(x)-\ell_i(x)=\frac{f''(\xi)}{2!}(x-x_{i-1})(x-x_i),
which simplifies if f(x)M|f''(x)|\le M by maximization as follows
f(x)i(x)Mh28,x[xi1,xi].|f(x)-\ell_i(x)|\le M\frac{h^2}{8},\quad x\in [x_{i-1},x_i].

Note. If ff'' is bounded over the whole interval [a,b][a,b] then the error is the same over the whole interval.

Hermite interpolation

Piecewise interpolation by degree 33 polynomials p3(x)=j=03cjxjp_3(x)=\sum_{j=0}^3 c_j x^j. As we have 44 coefficients, we need 44 constraints. We demand that not only the function but also the derivatives are exact at the nodes. Let pp be a cubic interpolation polynomial on [xi1,xi][x_{i-1},x_i]. Then pp' is a quadratic polynomial. Recall that h=xixi1h=x_i-x_{i-1}.
We have the conditions:

  1. p(xi1)=f(xi1)p(x_{i-1})=f(x_{i-1}),
  2. p(xi)=f(xi)p(x_{i})=f(x_{i}),
  3. p(xi1)=f(xi1)p'(x_{i-1})=f'(x_{i-1}),
  4. p(xi)=f(xi)p'(x_{i})=f'(x_{i}).

Set
p(x)=f(xi1)xxixi1xi+f(xi)xxi1xixi1+α(xxi1)(xxi).p'(x)=f'(x_{i-1})\frac{x-x_i}{x_{i-1}-x_i}+f'(x_i)\frac{x-x_{i-1}}{x_i-x_{i-1}}+\alpha(x-x_{i-1})(x-x_i).
Integrating yields
p(x)=f(xi1)hxi1x(txi)dt+f(xi)hxi1x(txi1)dt+αxi1x(txi1)(txi)dt+C.p(x)=-\frac{f'(x_{i-1})}{h}\int_{x_{i-1}}^x (t-x_i)\,dt+\frac{f'(x_i)}{h}\int_{x_{i-1}}^x (t-x_{i-1})\,dt+\alpha\int_{x_{i-1}}^x (t-x_{i-1})(t-x_i)\,dt+C.
Plugging in xi1x_{i-1} for xx yields that C=f(xi1)C=f(x_{i-1}). Plugging in xix_i for xx and integrating yields
α=3h3(f(xi1)+f(xi))+6h3(f(xi1)f(xi)).\alpha=\frac{3}{h^3}(f'(x_{i-1})+f'(x_i))+\frac{6}{h^3}(f(x_{i-1})-f(x_i)).

Splines

Let us construct a global piecewise interpolation function sC2s\in C^2 such that:

  1. We do not impose exactness for derivatives.
  2. We get a piecewise polynomial construction of cubic interpolation polynomials which is exact and has continuous 1st and 2nd derivatives.

This requires a global setup. All coefficients are defined first, only evaluation is piecewise.
Setup. Let h=xixi1h=x_i-x_{i-1} be constant. Define
zi:=s(xi),i=1,,n1.z_i:=s''(x_i),\quad i=1,\ldots,n-1.
Now,
s(x)=1hzi1(xix)+1hzi(xxi1).s''(x)=\frac{1}{h}z_{i-1}(x_i-x)+\frac{1}{h}z_i(x-x_{i-1}).
Denote ss on the interval [xi1,xi][x_{i-1},x_i] by sis_i. Integrating twice yields
si(x)=1hzi1(xix)36+1hzi(xxi1)36+Ci(xxi1)+Di,s_i(x)=\frac{1}{h}z_{i-1}\frac{(x_i-x)^3}{6}+\frac{1}{h}z_i\frac{(x-x_{i-1})^3}{6}+C_i(x-x_{i-1})+D_i,
where
si(x)=1hzi1(xix)22+1hzi(xxi1)22+Ci.s_i'(x)=-\frac{1}{h}z_{i-1}\frac{(x_i-x)^2}{2}+\frac{1}{h}z_i\frac{(x-x_{i-1})^2}{2}+C_i.
Set fi:=f(xi)f_i:=f(x_i). We get that
Di=fi1h26zi1D_i=f_{i-1}-\frac{h^2}{6}z_{i-1}
and
Ci=1h[fifi1+h26(zi1zi)].C_i=\frac{1}{h}\left[f_i-f_{i-1}+\frac{h^2}{6}(z_{i-1}-z_i)\right].
Now ss has been defined over all subintervals. However, the ziz_i are still unknown!
Using the condition for continuity of the derivatives si(xi)=si+1(xi)s_i'(x_i)=s_{i+1}'(x_i) for all ii yields
h2zi+1h[(fifi1)+h26(zi1zi)]=h2zi+1h[(fi+1fi)+h26(zizi+1)],\frac{h}{2}z_i+\frac{1}{h}\left[(f_i-f_{i-1})+\frac{h^2}{6}(z_{i-1}-z_i)\right]=-\frac{h}{2}z_i+\frac{1}{h}\left[(f_{i+1}-f_i)+\frac{h^2}{6}(z_i-z_{i+1})\right],
for i=1,,n1i=1,\ldots,n-1.
In fact, this constitutes a triangular system:
2h3zi+h6zi1+h6zi+1=1h(fi+12fi+fi1)=:bi.\frac{2h}{3}z_i+\frac{h}{6}z_{i-1}+\frac{h}{6}z_{i+1}=\frac{1}{h}(f_{i+1}-2f_i+f_{i-1})=:b_i.
The values z0z_0 and znz_n at the interval boundary have to be moved to the right hand side, and thus:
b1:=1h(f22f1+f0)h6z0b_1:=\frac{1}{h}(f_2-2f_1+f_0)-\frac{h}{6}z_0
and
bn1:=1h(fn2fn1+fn2)h6zn.b_{n-1}:=\frac{1}{h}(f_n-2f_{n-1}+f_{n-2})-\frac{h}{6}z_n.
z0z_0 and znz_n can be chosen freely, for example to force that the 1st derivative of the spline is exact at the interval boundary points. If z0=zn=0z_0=z_n=0, ss is called a natural spline.

Bézier curves

Bézier curves are parametrized curves in R2\mathbb{R}^2, that is, r(t)=x(t)i+y(t)j\mathbf{r}(t)=x(t)\mathbf{i}+y(t)\mathbf{j},
where i:=(10)\mathbf{i}:=\begin{pmatrix}1\\0\end{pmatrix}, j:=(01)\mathbf{j}:=\begin{pmatrix}0\\1\end{pmatrix} and x,y:[0,1]Rx,y:[0,1]\to\mathbb{R}.

Bernstein polynomials

Define the Bernstein polynomial Bkn(t)B_k^n(t), t[0,1]t\in [0,1], nN{0}n\in\mathbb{N}\cup\{0\}, k=0,,nk=0,\ldots,n, by
Bkn(t):=(nk)tk(1t)nk.B_k^n(t):={n\choose k}t^k(1-t)^{n-k}.
Properties:

  1. k=0nBkn(t)=1\sum_{k=0}^n B_k^n(t)=1 (=(t+1t)n=(t+1-t)^n),
  2. 0Bkn(t)10\le B_k^n(t)\le 1,
  3. B0n(0)=Bnn(1)=1B_0^n(0)=B_n^n(1)=1, otherwise, if k0k\not= 0, Bkn(0)=0B_k^n(0)=0 and if knk\not=n, Bkn(1)=0B_k^n(1)=0.

We have the combinatorial rule:
Bkn(t)=(1t)Bkn1(t)+tBk1n1(t).B_k^n(t)=(1-t)B_k^{n-1}(t)+tB_{k-1}^{n-1}(t).

Bézier curves

Fix a finite set X={x0,x1,xk}X=\{x_0,x_1\ldots,x_k\} of control points xiRnx_i\in\mathbb{R}^n.

Definition. The convex hull of XX is defined by
conv(X)={yRn   ⁣:  y=k=0kλixi,  λi[0,1],  k=0kλi=1}.\operatorname{conv}(X)=\left\{y\in\mathbb{R}^n\;\colon\; y=\sum_{k=0}^k \lambda_i x_i,\; \lambda_i\in [0,1],\;\sum_{k=0}^k\lambda_i=1\right\}.

Definition. The Bézier curve βn\beta^n is defined,
βn(t)=k=0nxkBkn(t).\beta^n(t)=\sum_{k=0}^n x_k B_k^n(t).
Sanity check: t=0t=0, Bkn(0)=0B_k^n(0)=0, except B0n(0)=1B_0^n(0)=1.
\Rightarrow βn(0)=x0\beta^n(0)=x_0,
\Rightarrow βn(1)=xn\beta^n(1)=x_n.
We get closed curves if x0=xnx_0=x_n.

What about the continuous tangents?
Recall that (nk)=n!k!(nk)!{n\choose k}=\frac{n!}{k!(n-k)!}.
ddtBkn(t)=(nk)(ktk1(1t)nk(nk)tk(1t)nk1)\frac{d}{dt}B_k^n(t)={n\choose k}\left(k t^{k-1}(1-t)^{n-k}-(n-k)t^k(1-t)^{n-k-1}\right)
=n[(n1)!(k1)!(nk)!tk1(1t)nk(n1)!(k)!(nk1)!tk(1t)nk1]=n\left[\frac{(n-1)!}{(k-1)!(n-k)!}t^{k-1}(1-t)^{n-k}-\frac{(n-1)!}{(k)!(n-k-1)!}t^{k}(1-t)^{n-k-1}\right]
=n(Bk1n1(t)Bkn1(t)).=n\left(B_{k-1}^{n-1}(t)-B_k^{n-1}(t)\right).
Therefore,
ddtβn(t)=nk=0n(Bk1n1(t)Bkn1(t))xk\frac{d}{dt}\beta^n(t)=n\sum_{k=0}^n\left(B_{k-1}^{n-1}(t)-B_k^{n-1}(t)\right) x_k
=n[k=1nBk1n1(t)xkk=0n1Bkn1(t)xk]=n\left[\sum_{k=1}^{n}B_{k-1}^{n-1}(t)x_k-\sum_{k=0}^{n-1}B_k^{n-1}(t)x_k\right]
=n[k=0n1Bkn1(t)xk+1k=0n1Bkn1(t)xk]=n\left[\sum_{k=0}^{n-1}B_{k}^{n-1}(t)x_{k+1}-\sum_{k=0}^{n-1}B_k^{n-1}(t)x_k\right]
=nk=0n1(xk+1xk)Bkn1(t)undefinedBeˊzier.=\underbrace{n\sum_{k=0}^{n-1}(x_{k+1}-x_k)B_k^{n-1}(t)}_{\text{Bézier}}.

Hence, for the closed curves:
{ddtβn(0)=n(x1x0),ddtβn(1)=n(xnxn1).\begin{cases}\frac{d}{dt}\beta^n(0)=n(x_1-x_0),\\ \frac{d}{dt}\beta^n(1)=n(x_n-x_{n-1}).\end{cases}
For smoothness, we need that (x1x0)(xnxn1)(x_1-x_0)\parallel (x_n-x_{n-1}), i.e., x1x0x_1-x_0 and xnxn1x_n-x_{n-1} are parallel.

Lifting

Control points define the curve but the converse is not true.
Consider:
βn(t)=k=0nBkn(t)xk=k=0n+1Bkn+1(t)yk=αn+1(t).\beta^n(t)=\sum_{k=0}^n B_k^n(t) x_k=\sum_{k=0}^{n+1} B_k^{n+1}(t)y_k=\alpha^{n+1}(t).
Let us use the convention x1=xn+1=0x_{-1}=x_{n+1}=0. We get the condition
yk=(1kn+1)xk+kn+1xk1.y_k=\left(1-\frac{k}{n+1}\right)x_k+\frac{k}{n+1}x_{k-1}.

De Casteljau’s algorithm

For control points x0,x1,,xnx_0,x_1,\ldots,x_n the algorithm of De Casteljau is as follows:

  1. Define constant curves βi0(t)=xi\beta_i^0(t)=x_i.
  2. Set βir(t)=(1t)βir1(t)+tβi+1r1(t),r=1,,n,i=0,,nr.\beta_i^r(t)=(1-t)\beta_{i}^{r-1}(t)+t\beta_{i+1}^{r-1}(t),\quad r=1,\ldots, n,\quad i=0,\ldots,n-r. Th

The algorithm terminates at β0n(t)\beta_0^n(t) and has (n2){n\choose 2} operations.

There is also a reverse algorithm for splitting Bézier curves.

Numerical integration

Integration schemes are called quadratures. Therefore, numerical integration methods are simply called numerical quadratures.

Note. There are no simple integration schemes in higher dimensions. Already 2D-cases are complicated.

Newton-Cotes quadrature rules

Let f:[a,b]Rf:[a,b]\to\mathbb{R}.
Idea. Approximate
abf(x)dx=:I\int_a^b f(x)\,dx=:I by the integral of an interpolation polynomial
Iabpk(x)dx=:Q(pk),I\approx\int_a^b p_k(x)\,dx=:Q(p_k),
where pkp_k is an interpolant of ff over [a,b][a,b].

Lagrange:
abf(x)dxi=0nf(xi)ab(j=0ijxxjxixj)dx.\int_a^b f(x)\,dx\approx \sum_{i=0}^n f(x_i)\int_a^b\left(\prod_{\substack{j=0\\i\not=j}}\frac{x-x_j}{x_i-x_j}\right)\,dx.

Let n=1n=1:
p1(x)=f(a)xbab+f(b)xaba,p_1(x)=f(a)\frac{x-b}{a-b}+f(b)\frac{x-a}{b-a},
so
abf(x)dxabp1(x)dx=ba2[f(a)+f(b)].\int_a^b f(x)\,dx\approx \int_a^b p_1(x)\,dx=\frac{b-a}{2}[f(a)+f(b)].
\Rightarrow Trapezoidal rule!
Error formulation:
abf(x)dxabp1(x)dx=12abf(ξ)(xa)(xb)dx.\int_a^b f(x)\,dx-\int_a^b p_1(x)\,dx=\frac{1}{2}\int_a^b f''(\xi)(x-a)(x-b)\,dx.
Now, (xa)(xb)<0(x-a)(x-b)<0 for x(a,b)x\in (a,b).
Therefore, by the mean value theory of integration,
=12f(η)ab(xa)(xb)dx.=\frac{1}{2}f''(\eta)\int_a^b (x-a)(x-b)\,dx.
=112(ba)3f(η).=-\frac{1}{12}(b-a)^3 f''(\eta).

Composite rule: h=banh=\frac{b-a}{n}, xi=a+ihx_i=a+ih, i=0,,ni=0,\ldots,n.
abf(x)dxh2[f(x0)+2f(x1)++2f(xn1)+f(xn)].\int_a^b f(x)\,dx\approx \frac{h}{2}\left[f(x_0)+2f(x_1)+\ldots+2f(x_{n-1})+f(x_n)\right].
Total error: O(h2)O(1n2)O(h^2)\sim O(\frac{1}{n^2}). We say that the method is quadratic.

Let n=2n=2. When is a method exact for degree 22 (or lower)?

Note. In this context, exactness means, that the integral and the method give the exact same result for
polynomials of certain order.

abf(x)dx=A1f(a)+A2f(a+b2)+A3f(b),\int_a^b f(x)\,dx=A_1 f(a)+A_2 f\left(\frac{a+b}{2}\right)+A_3 f(b),
where we call the AiA_i weights.
ab1dx=baA1+A2+A3=ba.\int_a^b 1\,dx=b-a\quad\Rightarrow A_1+A_2+A_3=b-a.
abxdx=b2a22A1a+A2(a+b2)+A3b=b2a22.\int_a^b x\,dx=\frac{b^2-a^2}{2}\quad\Rightarrow A_1 a+A_2\left(\frac{a+b}{2}\right)+A_3 b=\frac{b^2-a^2}{2}.
abx2dx=13(b3a3)A1a2+A2(a+b2)2+A3b2=13(b3a3).\int_a^b x^2\,dx=\frac{1}{3}(b^3-a^3)\quad\Rightarrow A_1 a^2+A_2 \left(\frac{a+b}{2}\right)^2+A_3 b^2=\frac{1}{3}(b^3-a^3).
Thus,
A1=A3=ba6A_1=A_3=\frac{b-a}{6}
and
A2=4(ba)6.A_2=\frac{4(b-a)}{6}.
As integrals and the methods are linear, this extends to all polynomials of deg2\deg\le 2.

This is the so-called Simpson’s rule:
abf(x)dxba6[f(a)+4f(a+b2)+f(b)].\int_a^b f(x)\,dx\approx\frac{b-a}{6}\left[f(a)+4 f\left(\frac{a+b}{2}\right)+f(b)\right].

The associated composite rule becomes:
abf(x)dxh6[f(x0)+4f(x1)+2f(x2)+4f(x3)++4f(xn1)+f(xn)].\int_a^b f(x)\,dx\approx \frac{h}{6}\left[f(x_0)+4f(x_1)+2f(x_2)+4f(x_3)+\ldots+4f(x_{n-1})+f(x_n)\right].
Error for n=2n=2:
14!5!(ba)5f(4)(η).\frac{1}{4!5!}(b-a)^5 f^{(4)}(\eta).
Error for the composite: O(h4)O(h^4). The method is exact for cubic polynomials!

Orthogonal polynomials

Define the inner product of two real-valued polynomials on [a,b][a,b] (depends on aa and bb!) by:
p,q=abp(x)q(x)dx.\langle p,q\rangle=\int_a^b p(x)q(x)\,dx.
The associated norm on [a,b][a,b] is given by
q:=(abq(x)2dx)1/2.\|q\|:=\left(\int_a^b |q(x)|^2\,dx\right)^{1/2}.
Definition. Two non-zero polynomials are said to be orthogonal on [a,b][a,b] if their inner product is zero. They are said to be orthonormal if they are orthogonal and have both norm 11.
In other words, orthogonality: p,q=0\langle p,q\rangle =0, then we write pqp\perp q.
Orthonormality: pqp\perp q and p,p=1=q,q\langle p,p\rangle =1=\langle q,q\rangle.

Gram-Schmidt (GS) procedure.
Idea. Transform a basis to an orthogonal one:
{1,x,x2,,xk,}{q0,q1,,qk,},orthonormal.\{1,x,x^2,\ldots,x^k,\ldots\}\longrightarrow \{q_0,q_1,\ldots,q_k,\ldots\},\quad\text{orthonormal}.

Note. The GS procedure depends on the inner product, and thus, here, on aa and bb.

The elements of the orthonormal basis are called orthogonal polynomials.

  1. q0=11=1(ab12dx)1/2=1ba.q_0=\frac{1}{\|1\|}=\frac{1}{\left(\int_a^b 1^2\,dx\right)^{1/2}}=\frac{1}{\sqrt{b-a}}.
  2. For j=1,2,j=1,2,\ldots, q~j(x)=xqj1(x)i=0j1xqj1,qiqi(x),\tilde{q}_j(x)=xq_{j-1}(x)-\sum_{i=0}^{j-1}\langle x q_{j-1},q_i\rangle q_i(x), and qj(x):=q~jq~j.q_j(x):=\frac{\tilde{q}_j}{\|\tilde{q}_j\|}.

The new basis is (pairwise) orthonormal!
Above, as usually, we denote the polynomial p(x)=xp(x)=x with the symbol xx.

Observation. By bilinearity, qj1q_{j-1} is orthogonal to all polynomials of degj2\deg\le j-2.
Thus,
xqj1,qi=qj1,xqi=0,ij3.\langle x q_{j-1},q_i\rangle=\langle q_{j-1},xq_i\rangle=0,\quad i\le j-3.
As a consequence, the GS procedure reduces to
q~j(x)=xqj1(x)xqj1,qj1qj1(x)xqj1,qj2qj2(x)\tilde{q}_{j}(x)=xq_{j-1}(x)-\langle x q_{j-1},q_{j-1}\rangle q_{j-1}(x)-\langle x q_{j-1},q_{j-2}\rangle q_{j-2}(x)
which is a three-term recurrence rule!

Note. The trick xqj1,qi=qj1,xqi\langle x q_{j-1},q_i\rangle=\langle q_{j-1},xq_i\rangle relies heavily on the fact that the inner product is defined by an integral and that we are dealing with polynomials. The GS procedure works generally in pre-Hilbert spaces, however, then we do not expect this kind of simplification.

Claim. The GS procedure works.
Proof.
q~j,qj1=xqj1,qj1i=0j1xqj1,qiqi,qj1undefined=0  except when  i=j1, then it is  =1\langle \tilde{q}_j,q_{j-1}\rangle=\langle x q_{j-1},q_{j-1}\rangle -\sum_{i=0}^{j-1}\langle x q_{j-1},q_i\rangle\underbrace{\langle q_i,q_{j-1}\rangle}_{=0\;\text{except when}\;i=j-1\text{, then it is}\;=1}
=xqj1,qj1xqj1,qj1=0.=\langle xq_{j-1},q_{j-1}\rangle-\langle xq_{j-1},q_{j-1}\rangle=0.
\Box

Gauss quadrature

Idea. Choose the nodes and the weights simultaneously.
One interval:
abf(x)dx=A0f(x1)+A1f(x1),\int_a^b f(x)\,dx=A_0 f(x_1)+A_1 f(x_1),
with weights A0A_0, A1A_1, and nodes x0x_0, x1x_1, for n=1n=1, this is a (n+1)=2(n+1)=2-rule.
The coefficients are determined by the usual process:
ab1dx=ba=A0+A1.\int_a^b 1\,dx=b-a=A_0+A_1.
abxdx=b2a22=A0x0+A1x1.\int_a^b x\,dx=\frac{b^2-a^2}{2}= A_0 x_0+A_1 x_1.
abx2dx=13(b3a3)=A0x02+A1x12.\int_a^b x^2\,dx=\frac{1}{3}(b^3-a^3)=A_0 x_0^2+A_1 x_1^2.
The resulting system is nonlinear!

Let us use the orthogonal polynomials in the following way.

Theorem. Let x0,x1,,xnx_0,x_1,\ldots,x_n be the roots of an orthogonal polynomial qn+1q_{n+1} on [a,b][a,b] of degree nn.
Then
abf(x)dxi=0nAif(xi),\int_a^b f(x)\,dx\approx\sum_{i=0}^n A_i f(x_i),
where
Ai:=abφi(x)dx,φi(x)=j=0jinxxjxixj,A_i:=\int_a^b\varphi_i(x)\,dx,\quad\varphi_i(x)=\prod_{\substack{j=0\\j\not=i}}^n\frac{x-x_j}{x_i-x_j},
is exact for all polynomials of degree 2n+12n+1 or less.

Proof. Let ff be a polynomial with degf=2n+1\deg f= 2n+1. By the polynomial division algorithm,
f=qn+1pn+rn,f=q_{n+1} p_n+r_n,
where degpnn\deg p_n\le n and degrnn\deg r_n\le n. Then,
f(xi)=qn+1(xi)undefined=0pn(xi)+rn(xi)=rn(xi).f(x_i)=\underbrace{q_{n+1}(x_i)}_{=0}p_n(x_i)+r_n(x_i)=r_n(x_i).
Integrate,
abf(x)dx=abqn+1(x)pn(x)dxundefined=qn+1,pn=0+abrn(x)dx\int_a^b f(x)\,dx=\underbrace{\int_a^b q_{n+1}(x)p_n(x)\,dx}_{=\langle q_{n+1},p_n\rangle=0}+\int_a^b r_n(x)\,dx
=abrn(x)dx=i=0nAirn(xi)=i=0nAif(xi).=\int_a^b r_n(x)\,dx=\sum_{i=0}^n A_i r_n(x_i)=\sum_{i=0}^n A_i f(x_i).
Because rnr_n can be interpolated exactly with n+1n+1 nodes. The last equality follows from the reasoning before. \Box

We can extend the notion of orthogonal polynomials to so-called weighted orthogonal polynomials with respect to the inner product
p,qw=abp(x)q(x)w(x)dx,\langle p,q\rangle_w=\int_a^b p(x)q(x)w(x)\,dx,
where ww is a positive weight function.

One (mathematical) advantage: Works also on R=(,)\mathbb{R}=(-\infty,\infty).

Example. If w(x)=exw(x)=e^{-x}, we get the so-called Laguerre polynomials. If w(x)=ex22w(x)=e^{-\frac{x^2}{2}}, we get the so-called Hermite polynomials, which are meaningful in probability theory (the weight is the density of the Gaussian normal distribution up to multiplication by a normalization constant).

Theorem. The previous theorem holds a ,w\langle\cdot,\cdot\rangle_w-orthogonal polynomial qn+1q_{n+1} with
Ai:=abφi(x)w(x)dx,φi(x)=j=0jinxxjxixj.A_i:=\int_a^b\varphi_i(x)w(x)\,dx,\quad\varphi_i(x)=\prod_{\substack{j=0\\j\not=i}}^n\frac{x-x_j}{x_i-x_j}.

Error formula. (n+1)(n+1)-point rule with nodes x0,x1,,xnx_0,x_1,\ldots, x_n:
error=f(2(n+1))(ξ(x))(2(n+1))!j=0jin(xxj)2.\text{error}=\frac{f^{(2(n+1))}(\xi(x))}{(2(n+1))!}\prod_{\substack{j=0\\j\not=i}}^n (x-x_j)^2.
Where does the square come from?
We assume that the derivatives of ff are continuous, therefore Hermite interpolation is the natural choice.

Example. Gauss rule on [1,1][-1,1], n=1n=1. Notice, since we only want the roots, there is no need to normalize q~i\tilde{q}_i, i=0,1,2i=0,1,2.
GS: q~0=1\tilde{q}_0=1.
q~1=x1x,11,11=x11xdx111dx1=x,\tilde{q}_1=x\cdot 1-\frac{\langle x,1\rangle}{\langle 1,1\rangle}\cdot 1=x-\frac{\int_{-1}^1 x\,dx}{\int_{-1}^1 1\,dx}\cdot 1=x,
where 1,1=2\langle 1,1\rangle=2, and x,1=0\langle x,1\rangle=0.
q~2=xxx2,11,11x2,xx,xx=x213,\tilde{q}_2=x\cdot x-\frac{\langle x^2,1\rangle}{\langle 1,1\rangle}\cdot 1-\frac{\langle x^2,x\rangle}{\langle x,x\rangle}\cdot x=x^2-\frac{1}{3},
where x,x=x2,1=23\langle x,x\rangle=\langle x^2,1\rangle=\frac{2}{3} and x2,x=0\langle x^2,x\rangle=0.
The resulting orthogonal polynomials on [1,1][-1,1] are called Legendre polynomials.
q~2\tilde{q}_2 (and q2q_2) has the roots x=±13x=\pm\frac{1}{\sqrt{3}}.
The associated Newton quadrature rule is:
11f(x)dx=A0f(13))+A1f(13).\int_{-1}^1 f(x)\,dx=A_0 f\left(-\frac{1}{\sqrt{3}}\right))+A_1 f\left(\frac{1}{\sqrt{3}}\right).
Let us check exactness:
111dx=2=A0+A1.\int_{-1}^1 1\,dx=2=A_0+A_1.
11xdx=0=A03+A13.\int_{-1}^1 x\,dx=0= \frac{-A_0}{\sqrt{3}}+\frac{A_1}{\sqrt{3}}.
From this, we obtain easily that A0=A1=1A_0=A_1=1. This weights could of course also been determined by integrating the Lagrange polynomials over [1,1][-1,1].
abx2dx=23=1(13)2+1(13)2.\int_a^b x^2\,dx=\frac{2}{3}=1\cdot\left(-\frac{1}{\sqrt{3}}\right)^2+1\cdot\left(\frac{1}{\sqrt{3}}\right)^2.
abx3dx=0=1(13)3+1(13)3.\int_a^b x^3\,dx=0=1\cdot\left(-\frac{1}{\sqrt{3}}\right)^3+1\cdot\left(\frac{1}{\sqrt{3}}\right)^3.
Thus, the Newton quadrature is indeed exact up to degree 2n+1=32n+1=3!

Probabilistic examples

Monte Carlo integration

Let XiX_i, iNi\in\mathbb{N}, be i.i.d. (independent, identically distributed) random variables with mean μ\mu and variance σ2\sigma^2. Then for the arithmetic mean (also called Césaro sum)
AN:=1Ni=1NXi.A_N:=\frac{1}{N}\sum_{i=1}^N X_i.
By the law of large numbers, we have almost surely
limNAN=μ.\lim_{N\to\infty}A_N=\mu.
We have that
var(AN)=1N2i=1Nvar(Xi)=σ2N.\operatorname{var}(A_N)=\frac{1}{N^2}\sum_{i=1}^N\operatorname{var}(X_i)=\frac{\sigma^2}{N}.
In order to get the right unit, we have to consider the standard deviation
σ(AN)=σN.\sigma(A_N)=\frac{\sigma}{\sqrt{N}}.
Consequence:
If our integration problem can be cast into an averaging problem, the convergence rate will be O(1N)O(\frac{1}{\sqrt{N}}).

Note. The rate is independent of the spatial dimension.

Example. Estimating the value of π\pi. The area of a circle is A=πr2A=\pi r^2. Set r=1r=1. Consider the box V=[1,1]×[1,1]V=[-1,1]\times [-1,1] with volume V=4|V|=4. The ratio of the areas of circle enclosed by the box and the enclosing box is π4\frac{\pi}{4}. Let
gi={1,    if p is inside A,0,    otherwise.g_i=\begin{cases}1,\;\;\text{if $p$ is inside $A$,}\\0,\;\;\text{otherwise.} \end{cases}
Idea: Let us sample the points pip_i uniformly from VV. In the limit, the number of “hits” over all samples tends to the ratio of the areas!

Buffon’s needle

Suppose we doing a random experiment with a large number of needles of same length LL that we throw on the floor, which has parallel strips drawn on it which have all the same distance DD to their neighboring strip.

What is the probability that a dropped needle intersects with a line?

Let yy be the distance from the center of the needle to the closest line and let θ\theta be the acute angle of the intersection point.

We assume, for simplicity, L=D=1L=D=1.

Both yy and θ\theta are random variables with distribution
yUnif([0,12]),y\sim \operatorname{Unif}\left(\left[0,\frac{1}{2}\right]\right),
where y=0y=0 means that the needle is centered on a line and y=12y=\frac{1}{2} means that yy is perfectly centered between two lines.
θUnif([0,π2]),\theta\sim \operatorname{Unif}\left(\left[0,\frac{\pi}{2}\right]\right),
where θ=0\theta=0 means that the needle is parallel to the lines and θ=π2\theta=\frac{\pi}{2} means that the needle is perpendicular to the lines.

We may assume that yy and θ\theta are independent random variables (why?).

By the law of sines, the condition for intersection with a line is
2ysinθ.2y\le \sin\theta.
The joint probability distribution of two independent random variables is the product of the respective distributions on [0,π2]×[0,12][0,\frac{\pi}{2}]\times [0,\frac{1}{2}]. Determining the probability requires calculation of the ratio of the area of the condition of intersection in relation to the total area π4\frac{\pi}{4}.

The condition is fulfilled by
120π2sinθdθ=12.\frac{1}{2}\int_0^{\frac{\pi}{2}}\sin\theta\,d\theta=\frac{1}{2}.

Thus the probability of intersection is
12/π4=2π.\frac{1}{2}\bigg/\frac{\pi}{4}=\frac{2}{\pi}.

By the law of large numbers, the ratio of needles intersecting the lines with all needles converges to 2π0.6366\frac{2}{\pi}\approx 0.6366. Hence, we have found yet another probabilistic algorithm to determine the value of π\pi.

Initial value problems (IVPs)

Problem. (Not necessarily linear) ordinary differential equation (ODE), with initial value y0y_0 at initial time t0t_0 up to a finite time horizon T>t0T>t_0:
{y(t)=f(t,y(t)),y(t0)=y0.\begin{cases}y'(t)=f(t,y(t)),\\y(t_0)=y_0.\end{cases}

Assumptions. Existence and uniqueness of the solutions are understood (by e.g. Picard iteration).
Let us assume that ff is continuous as a function from [t0,T]×RR[t_0,T]\times\mathbb{R}\to\mathbb{R} and Lipschitz continuous in the following sense:
There exists L>0L>0 such that for every y,zRy,z\in\mathbb{R}, t[t0,T]t\in [t_0,T],
f(t,y)f(t,z)Lyz.|f(t,y)-f(t,z)|\le L|y-z|.

Euler’s method

Fix a constant step size h>0h>0.

  1. y0:=y(t0)y_0:=y(t_0).
  2. tk:=tk1+ht_{k}:=t_{k-1}+h and yk+1=yk+hf(tk,yk)y_{k+1}=y_k+h f(t_k,y_k), k=0,1,2,k=0,1,2\ldots,.

Applying Taylor’s formula yields:
y(tk+1)=y(tk)+hy(tk)+h22y(ξk)=y(tk)+hf(tk,y(tk))+h22y(ξk),y(t_{k+1})=y(t_k)+hy'(t_k)+\frac{h^2}{2}y''(\xi_k)=y(t_k)+hf(t_k,y(t_k))+\frac{h^2}{2}y''(\xi_k),
with ξk[a,b]\xi_k\in [a,b].

We shall deal with two types of error:

(A) truncation error (local),
(B) global error.

Notation. y(tk)y(t_k) denotes the exact solution to the IVP at t=tkt=t_k, whereas yky_k denotes the result of the method at step kk.

For Euler, we get that
yk+1ykh=f(tk,yk)+h2y(ξk)undefinedlocal error  O(h).\frac{y_{k+1}-y_k}{h}=f(t_k,y_k)+\underbrace{\frac{h}{2}y''(\xi_k)}_{\text{local error}\;O(h)}.
Hence the Euler method is locally (in each point) or order 11.
The method is consistent:
limh0yk+1ykh=y(tk)=f(tk,y(tk)).\lim_{h\to 0}\frac{y_{k+1}-y_k}{h}=y'(t_k)=f(t_k,y(t_k)).

Note that kk depends on hh, which we omit in the notation.

What about the global error, that is, uniform convergence on [t0,T][t_0,T]?
maxy(tk)yk0\max |y(t_k)-y_k|\to 0 as h0h\to 0?

Theorem. Assume the following:

  1. ff is Lipschitz in the second component.
  2. maxy(tk)M\max |y''(t_k)|\le M for some global constant M>0M>0.
  3. y0y(t0)y_0\to y(t_0) as h0h\to 0.

Then Euler’s method is uniformly convergent to the exact solution on [t0,T][t_0,T] and the global error is O(h)O(h).

Proof. Set dj:=y(tj)yjd_j:=y(t_j)-y_j.
By Taylor and Euler:
dk+1=dk+h[f(tk,y(tk))f(tk,yk)]+h22y(ξk).d_{k+1}=d_k+h[f(t_k,y(t_k))-f(t_k,y_k)]+\frac{h^2}{2}y''(\xi_k).
We get that
dk+1dk+hLdk+h22M=(1+hL)dk+h22M.|d_{k+1}|\le |d_k|+hL|d_k|+\frac{h^2}{2}M=(1+hL)|d_k|+\frac{h^2}{2}M.

We shall need a lemma on recursive inequalities.

Lemma. If for α,β>0\alpha,\beta>0,
γk+1(1+α)γk+β,\gamma_{k+1}\le(1+\alpha)\gamma_k+\beta,
then
γnenαγ0+enα1αβ.\gamma_n\le e^{n\alpha}\gamma_0+\frac{e^{n\alpha}-1}{\alpha}\beta.

Proof. Iterating the inequality yields
γn(1+α)2γn2+[(1+α)+1]β(1+α)nγ0+[j=0n1(1+α)j]β.\gamma_n\le (1+\alpha)^2\gamma_{n-2}+[(1+\alpha)+1]\beta\le(1+\alpha)^n\gamma_0+\left[\sum_{j=0}^{n-1}(1+\alpha)^j\right]\beta.
Note that by the Taylor formula,
eα=e0+e0α+α22eξ=1+α+α22eξe^\alpha=e^0+e^0\alpha+\frac{\alpha^2}{2}e^\xi=1+\alpha+\frac{\alpha^2}{2}e^\xi
with ξ[0,α].\xi\in [0,\alpha].
Hence
1+α1+α+α22eξundefined>0=eα.1+\alpha\le 1+\alpha+\underbrace{\frac{\alpha^2}{2}e^\xi}_{>0}=e^\alpha.
And thus,
γnenαγ0+1(1+α)n1(1α)βenαγ0+enα1αβ.\gamma_n\le e^{n\alpha}\gamma_0+\frac{1-(1+\alpha)^n}{1-(1-\alpha)}\beta\le e^{n\alpha}\gamma_0+\frac{e^{n\alpha}-1}{\alpha}\beta.
\Box

Returning to the proof of the theorem, we get that
dkekhLd0+ekhL1Lhh22M.|d_k|\le e^{khL}|d_0|+\frac{e^{khL}-1}{Lh}\frac{h^2}{2}M.
Now for khTt0kh\le T-t_0,
maxkdkeL(Tt0)d0+eL(Tt0)1Lh2M.\max_{k}|d_k|\le e^{L(T-t_0)}|d_0|+\frac{e^{L(T-t_0)}-1}{L}\frac{h}{2}M.
d00|d_0|\to 0 as h0h\to 0 by the 3rd assumption. Hence, the method converges uniformly with linear convergence rate. \Box

Heun’s method

Idea. Predictor –corrector.

y~k+1=yk+hf(tk,yk)(prediction)\tilde{y}_{k+1}=y_k+h f(t_k,y_k)\quad\text{(prediction)}
yk+1=yk+h2[f(tk,yk)+f(tk+1,y~k+1)](correction)y_{k+1}=y_k+\frac{h}{2}[f(t_k,y_k)+f(t_{k+1},\tilde{y}_{k+1})]\quad\text{(correction)}

Explicit vs. Implicit

Quadrature. Integral formulation of the IVP:
y(t+h)=y(t)+tt+hf(s,y(s))ds,y(t+h)=y(t)+\int_t^{t+h}f(s,y(s))\,ds,
apply your favorite quadrature rule, for instance:
h2[f(t,y(t))+f(t+h,y(t+h))]+O(h3).\frac{h}{2}[f(t,y(t))+f(t+h,y(t+h))]+O(h^3).
Combined, we get:
yk+1=yk+h2[f(tk,yk)+f(tk+1,yk+1)].y_{k+1}= y_k+\frac{h}{2}[f(t_k,y_k)+f(t_{k+1},y_{k+1})].
This method is implicit. Every step requires a solution of a nonlinear (fixed point) problem.

Heun’s method and Euler’s method are explicit.

Linear multistep methods

y(tk+1)=y(tk)+tktk+1f(s,y(s))ds.y(t_{k+1})=y(t_k)+\int_{t_k}^{t_{k+1}}f(s,y(s))\,ds.

Adams-Bashforth (explicit)

Interpolation nodes tk,tk1,,tkm+1t_k,t_{k-1},\ldots,t_{k-m+1}. Polynomial pm1p_{m-1}.
yk+1=yk+tktk+1pm1(s)ds=yk+hl=0m1blf(tkl,ykl),y_{k+1}=y_k+\int_{t_k}^{t_{k+1}} p_{m-1}(s)\,ds=y_k+h\sum_{l=0}^{m-1}b_l f(t_{k-l},y_{k-l}),
where
bl=1htktk+1(j=0jlm1stkjtkltkj)ds.b_l=\frac{1}{h}\int_{t_k}^{t_{k+1}}\left(\prod_{\substack{j=0\\ j\not=l}}^{m-1}\frac{s-t_{k-j}}{t_{k-l}-t_{k-j}}\right)\,ds.
The truncation error is O(hm)O(h^m).

What methods can be recovered?
For m=1m=1, l=0l=0, we get b0=1b_0=1 and
yk+1=yk+hf(tk,yk),y_{k+1}=y_k+hf(t_k,y_k),
and thus Euler’s method!

Adams-Moulton (implicit)

Add tk+1t_{k+1} as an interpolation node. Interpolation polynomial qmq_m.
yk+1=yk+tktk+1qm(s)ds=yk+hl=0mclf(tk+1l,yk+1l),y_{k+1}=y_k +\int_{t_k}^{t_{k+1}} q_{m}(s)\,ds=y_k+h\sum_{l=0}^{m}c_l f(t_{k+1-l},y_{k+1-l}),
where
cl=1htktk+1(j=0jlmstk+1jtk+1ltk+1j)ds.c_l=\frac{1}{h}\int_{t_k}^{t_{k+1}}\left(\prod_{\substack{j=0\\ j\not=l}}^{m}\frac{s-t_{k+1-j}}{t_{k+1-l}-t_{k+1-j}}\right)\,ds.
The truncation error is O(hm+1)O(h^{m+1}).

For m=0m=0, l=0l=0, we get c0=1c_0=1 and
yk+1=yk+hf(tk+1,yk+1),y_{k+1}=y_k+hf(t_{k+1},y_{k+1}),
which is the so-called backward Euler method (also called implicit Euler method)!

Why do we need multistep methods?

Bad Example.
y=15y,y(0)=1.y'=-15 y,\quad y(0)=1.
Exact solution y(t)=e15ty(t)=e^{-15t}. For h=14h=\frac{1}{4} Euler’s method oscillates about zero.
Adams-Moulton (trapezoidal method) works!

General form.

The general form of a linear multistep method is given for sNs\in \mathbb{N} by
j=0sakyn+j=hj=0sbjf(tn+j,yn+j),\sum_{j=0}^s a_k y_{n+j}=h\sum_{j=0}^s b_j f(t_{n+j},y_{n+j}),
where as=1a_s=1 (normalization) and the coefficients a0,,as1a_0,\ldots, a_{s-1} and b0,,bsb_0,\ldots,b_s determine the method.

The method is called explicit if bs=0b_s=0, and implicit otherwise.

We call the multistep method consistent if the truncation error is O(h)O(h) or better.

Theorem. The linear multistep method is consistent if and only if
k=0s1ak=1\sum_{k=0}^{s-1}a_k=-1
and
k=0sbk=s+k=0s1kak.\sum_{k=0}^s b_k=s+\sum_{k=0}^{s-1} k a_k.
If, moreover,
qk=0skq1bk=sq+k=0s1kqak,q\sum_{k=0}^s k^{q-1} b_k=s^q+\sum_{k=0}^{s-1} k^q a_k,
for every q=1,,pq=1,\ldots,p then the truncation error is O(h1+p)O(h^{1+p}).
(Here, we follow the non-standard convention that k0=0k^0=0 if k=0k=0).

See [Ernst Hairer, Gerhard Wanner, Syvert P. Nørsett. Solving Ordinary Differential Equations I: Nonstiff Problems. Springer, 2nd ed., 1993] for a proof.

The stability of multistep methods depends on the convergence of the initial values y1,,ys1y_1,\ldots,y_{s-1} to y0y_0 as h0h\to 0. The second condition yields a global error O(hp)O(h^p).

Example. m=1m=1, a0+a1=0a_0+a_1=0, 0a0+1a1=b10\cdot a_0+1\cdot a_1=b_1, and by the normalization assumption, a1=1a_1=1, a0=1a_0=-1, b1=1b_1=1, and thus we get,
yk+1=yk+hf(tk+1,yk+1)y_{k+1}=y_k+h f(t_{k+1},y_{k+1})
backward Euler!

Example. (Good bad example)

yk+23yk+1+2yk=h[1312f(tk+2,yk+2)53f(tk+1,yk+1)512f(tk,yk)].y_{k+2}-3y_{k+1}+2y_{k}=h\left[\frac{13}{12}f(t_{k+2},y_{k+2})-\frac{5}{3}f(t_{k+1},y_{k+1})-\frac{5}{12}f(t_k,y_k)\right].
This method satisfies the first condition. For the second condition, p=q=1p=q=1, we get that 11312-1\not=\frac{13}{12}, so the second condition is not satisfied and the method is not stable. Indeed,
for
y=0,y(0)=1,y'=0,\quad y(0)=1,
which has the explicit solution y(t)=1y(t)=1, we consider a small perturbation of the initial value, δ>0\delta>0,
so that
y0=1,y1=1+δ,y_0=1,\quad y_1=1+\delta,
y2=3y12y0=1+3δ,y_2=3y_1-2y_0=1+3\delta,
\cdots
yk=3yk12yk2=1+(2k1)δ.y_k=3y_{k-1}-2y_{k-2}=1+(2^k-1)\delta.
Hence, for δ253\delta\sim 2^{-53}, and k=100k=100, we get the error 247\sim 2^{47}.
We note, however, that the method is consistent and the exact differential equation is stable (in the mathematical sense), and the perturbation yδ(t)=1+δy_\delta(t)=1+\delta converges uniformly to the exact solution y(t)=1y(t)=1 as δ0\delta\to 0.

Example. (Effect of rounding error)

Returning to the proof of convergence for Euler, for the rounding error δ>0\delta>0,
dk+1(1+hL)dk+δ,|d_{k+1}|\le (1+hL)|d_k|+\delta,
we get,
dk+1eL(Tt0)d0undefinedinitial error or uncertainty+eL(Tt0)1Lhδundefineddominant term, if h is sufficiently small|d_{k+1}|\le e^{L(T-t_0)}\underbrace{|d_0|}_{\text{initial error or uncertainty}}+\underbrace{\frac{e^{L(T-t_0)}-1}{Lh}\delta}_{\text{dominant term, if $h$ is sufficiently small}}

Gradient descent

The following algorithm is widely used in machine learning, together with its probabilistic counterpart, the stochastic gradient decent (SDG).

The goal is to find the minima of a function
f:DR,DRd,f:D\to\mathbb{R},\quad D\subset\mathbb{R}^d,
which is assumed suitably regular, e.g. fC1(DD)f\in C^1(D\setminus\partial D).

Gradient descent algorithm.

For simplicity, assume that 0D0\in D.

For k=0,,Nk=0,\ldots, N, where NNN\in\mathbb{N}, iterate:

  1. Fix initial point w0:=0Dw_0:=0\in D.
  2. wk+1w_{k+1} is obtained by moving away from wkw_k in the opposite direction of the gradient of ff at wkw_k, with step size ηk+1>0\eta_{k+1}>0, more precisely,
    wk+1:=wkηk+1f(wk),k=0,1,,N.w_{k+1}:=w_k-\eta_{k+1}\nabla f(w_{k}),\quad k=0,1,\ldots, N.

The constants ηk\eta_k are called learning rates.

  1. After the NNth step, we may choose different outputs, as e.g. just wˉN:=wN\bar{w}_N:=w_N or
    wˉN:=arg mink=0,,Nf(wk).\bar{w}_N:=\operatorname{arg\,min}_{k=0,\ldots,N}f(w_k).
    Less obviously, one may also choose
    wˉN:=1N+1k=0Nwk,\bar{w}_N:=\frac{1}{N+1}\sum_{k=0}^N w_k,
    which is particularly useful for the SDG.

Definition. f:RdRf:\mathbb{R}^d\to\mathbb{R} is called convex, if for every λ[0,1]\lambda\in [0,1], x,yRdx,y\in \mathbb{R}^d,
f(λx+(1λy)λf(x)+(1λ)f(y).f(\lambda x+(1-\lambda y)\le\lambda f(x)+(1-\lambda)f(y).

Note that if ff is convex,
f(i=0Nλixi)i=0Nλif(xi),f\left(\sum_{i=0}^N \lambda_i x_i\right)\le \sum_{i=0}^N \lambda_i f(x_i),
for every x0,x1,,xNRdx_0,x_1,\ldots,x_N\in \mathbb{R}^d, whenever λi[0,1]\lambda_i\in [0,1] satisfy i=0Nλi=1\sum_{i=0}^N\lambda_i=1.

Continuously differentiable convex functions f:RdRf:\mathbb{R}^d\to\mathbb{R} enjoy the so-called subgradient property, i.e.
f(x)f(y)f(x),xy,x,yRd,f(x)-f(y)\le \langle \nabla f(x),x-y\rangle,\quad x,y\in\mathbb{R}^d,
where ,\langle\cdot,\cdot\rangle denotes the Euclidean scalar product.

Theorem. Let f:RdRf:\mathbb{R}^d\to\mathbb{R} be convex, continuously differentiable and LL-Lipschitz continuous, i.e.,
f(x)f(y)Lxy,x,yRd.|f(x)-f(y)|\le L|x-y|,\quad x,y\in \mathbb{R}^d.
Let R>0R>0, NNN\in\mathbb{N}. Set
m:=minwRf(w),ηk:=η:=RLN+1.m:=\min_{|w|\le R}f(w),\quad \eta_k:=\eta:=\frac{R}{L\sqrt{N+1}}.
Then for
wˉN:=1N+1k=0Nwk,\bar{w}_N:=\frac{1}{N+1}\sum_{k=0}^N w_k,
we have that
f(wˉN)mRLN+1.f(\bar{w}_N)-m\le\frac{RL}{\sqrt{N+1}}.

Note. The point wˉN\bar{w}_N is doubly dependent on NN; not only through the number of steps, but also through the choice of η\eta.

Remark.

  1. Assume that ff has a global minimum in wRdw_\ast\in\mathbb{R}^d. Then, the above result ensures the convergence of f(wˉN)f(\bar{w}_N) to the minimum f(w)f(w_\ast), provided that RwR\ge |w_\ast|. Indeed, the claimed estimate, together with f(wˉN)f(w)0f(\bar{w}_N)-f(w_\ast)\ge 0 yields f(wˉN)f(w)RLN+1.|f(\bar{w}_N)-f(w_\ast)|\le \frac{RL}{\sqrt{N+1}}.
  2. It is not guaranteed that {wˉN}NN\{\bar{w}_N\}_{N\in\mathbb{N}} converges to ww_\ast unless ww_\ast is the unique minimizer (e.g. if ff is so-called strictly convex (i.e., the inequality in the definition of convexity is strict for λ(0,1)\lambda\in (0,1)).
  3. The convergence rate is sublinear unless ff is so-called strongly convex (i.e., fδ2f-\delta|\cdot|^2 is convex for some δ>0\delta>0), which gives a linear convergence rate.

We start by proving an auxiliary result.

Lemma. Let v1,v2,,vk+1,wv_1,v_2,\ldots,v_{k+1},w_\ast be a sequence of vectors in Rd\mathbb{R}^d, and let η>0\eta>0. Setting
w0=0w_0=0 and
wk:=wk1ηvkkN,w_k:=w_{k-1}-\eta v_k \quad k\in\mathbb{N},
we get that
k=0Nvk+1,wkww22η+η2k=0Nvk+12.\sum_{k=0}^N \langle v_{k+1},w_k-w_\ast\rangle \le \frac{|w_\ast|^2}{2\eta}+\frac{\eta}{2}\sum_{k=0}^N |v_{k+1}|^2.
In particular, we have that
1N+1k=0Nvk+1,wkwRLN+1,\frac{1}{N+1}\sum_{k=0}^N \langle v_{k+1},w_k-w_\ast\rangle\le \frac{RL}{\sqrt{N+1}},
for any R,L>0R,L>0 such that
η=RLN+1,\eta=\frac{R}{L\sqrt{N+1}},
and
wR,vkL,k=1,,N+1.|w_\ast|\le R,\quad |v_k|\le L,\quad k=1,\ldots,N+1.

Proof. A direct computation shows (polarization identity)
vk+1,wkw=12η(wkw2+η2vk+12wkwηvk+12)=12η(wkw2wk+1w2)+η2vk+12.\langle v_{k+1},w_k-w_\ast\rangle =\frac{1}{2\eta}\left(|w_k-w_\ast|^2+\eta^2|v_{k+1}|^2-|w_k-w_\ast-\eta v_{k+1}|^2\right)=\frac{1}{2\eta}\left(|w_k-w_\ast|^2-|w_{k+1}-w_\ast|^2\right)+\frac{\eta}{2}|v_{k+1}|^2.
Adding up with respect to kk yields
k=0Nvk+1,wkw=12ηk=0N(wkw2wk+1w2)+η2k=0Nvk+12.\sum_{k=0}^N\langle v_{k+1},w_k-w_\ast\rangle =\frac{1}{2\eta}\sum_{k=0}^N\left(|w_k-w_\ast|^2-|w_{k+1}-w_\ast|^2\right)+\frac{\eta}{2}\sum_{k=0}^N|v_{k+1}|^2.
The first term is a telescoping sum and w0=0w_0=0, so that we get
k=0Nvk+1,wkw=12η(w2wNw2)+η2k=0Nvk+12,\sum_{k=0}^N\langle v_{k+1},w_k-w_\ast\rangle =\frac{1}{2\eta}\left(|w_\ast|^2-|w_N-w_\ast|^2\right)+\frac{\eta}{2}\sum_{k=0}^N|v_{k+1}|^2,
which proves the first assertion.
For the second assertion it is enough to observe that under our conditions
w22η+η2k=0Nvk+12R22η+η(N+1)L22=RLN+1\frac{|w_\ast|^2}{2\eta}+\frac{\eta}{2}\sum_{k=0}^N |v_{k+1}|^2\le \frac{R^2}{2\eta}+\frac{\eta(N+1)L^2}{2}=RL\sqrt{N+1}
\Box

Proof of the Theorem. Recalling that ff is convex, we get that
f(wˉN)=f(1N+1k=0Nwk)1N+1k=0Nf(wk).f(\bar{w}_N)=f\left(\frac{1}{N+1}\sum_{k=0}^N w_k\right)\le\frac{1}{N+1}\sum_{k=0}^N f(w_k).
Therefore, for any warg minwRf(w)w_\ast\in\operatorname{arg\,min}_{|w|\le R}f(w), we obtain by the Lemma,
f(wˉN)m=f(wˉN)f(w)1N+1k=0M(f(wk)f(w))1N+1k=0Nf(wk)undefined=:vk+1,wkwRLN+1.f(\bar{w}_N)-m=f(\bar{w}_N)-f(w_\ast)\le\frac{1}{N+1}\sum_{k=0}^M(f(w_k)-f(w_\ast))\le\frac{1}{N+1}\sum_{k=0}^N\langle\underbrace{\nabla f(w_k)}_{=:v_{k+1}},w_k-w_\ast\rangle\le \frac{RL}{\sqrt{N+1}}.
\Box

Literature

  1. Anne Greenbaum and Tim P. Chartier. Numerical Methods: Design, Analysis, and Computer Implementation of Algorithms, Princeton University Press, 2012.
  2. L. Ridgway Scott. Numerical Analysis, Princeton University Press, 2011.
  3. Qingkai Kong, Timmy Siauw, and Alexandre Bayen. Python Programming and Numerical Methods. A Guide for Engineers and Scientists. Academic Press, 2020.
  4. Tobin A. Driscoll and Richard J. Braun, Fundamentals of Numerical Computation, SIAM, 2017.
  5. Ernst Hairer, Gerhard Wanner, Syvert P. Nørsett. Solving Ordinary Differential Equations I: Nonstiff Problems. Springer, 2nd ed., 1993.
  6. Real Analysis, Wikibooks, Creative Commons BY-SA 4.0.
  7. Stefano Pagliarani. An introduction to discrete-time stochastic processes and their applications. Lecture notes, Alma Mater Studiorum - Università di Bologna, 2024.
  8. Harri Hakula. Numerical Analysis. Lecture transcript, Aalto University, 2021.

  1. jonas.tolle@aalto.fi ↩︎