Contents

Theory about applications - [TA4]

Question
Illustrate in particular, Knuth recursion for the computation of the arithmetic mean or average, discussion why it is preferable to the “naive” algo

Introduction

The Knuth online algorithm described in the lesson allows us to calculate the arithmetic mean without needing to keep track of all the values collected. This is particularly useful for the reasons we will explain after a brief demonstration on how to obtain the formula.

Proof

First of all let’s define some notation. We will denote with \(x_{i}\) the i-th element of the sum and with $\bar x_{i}$ the arithmetic mean at step i.

Here is the naive formula to calculate the arithmetic mean: $$ \bar x_{n}={\frac {1}{n}}\sum_ {i=1}^{n}x_{i}={\frac {x_{1}+x_{2}+\cdots +x_{n}}{n}} $$

Given this general formula we can rewrite it putting emphasis on the last and penultimate elements: $$ \bar x_{n}={\frac {x_{1}+\cdots+x_{n-1}+x_{n}}{n}} \\ $$ $$ \bar x_{n}={\frac {x_{1}+\cdots+x_{n-1}}{n}}+{\frac {x_{n}}{n}} $$ Now we can multiply the first member of the addition by $ (\frac {n-1}{n-1}) $ $$ \bar x_{n}=({\frac {n-1}{n-1}})\cdot {\frac {x_{1}+\cdots+x_{n-1}}{n}}+{\frac {x_{n}}{n}} $$ $$ \bar x_{n}=({\frac {n-1}{n}})\cdot \fcolorbox{red}{#292a2d}{$\frac {x_{1}+\cdots+x_{n-1}}{n-1}$}+{\frac {x_{n}}{n}} $$ Note how the highlighted member is simply $ \bar x_{n-1} $, we can therefore rewrite the formula as $$ \bar x_{n}=({\frac {n-1}{n}})\cdot {\bar x_{n-1}}+{\frac {x_{n}}{n}} $$ $$ \bar x_{n}={\frac {(\bar x_{n-1}\cdot n)-\bar x_{n-1}}{n}}+{\frac {x_{n}}{n}} $$ $$ \bar x_{n}={\frac {\bar x_{n-1}\cdot \cancel n}{\cancel n}}-{\frac {\bar x_{n-1}}{n}}+{\frac {x_{n}}{n}} $$ From which we can obtain the final formula $$ \colorbox{#208020}{$ \Large{\bar x_{n}=\bar x_{n-1}+{\frac {x_{n}-\bar x_{n-1}}{n}} } $} $$ We recognise this as the same formula seen during the lesson.

An alternative solution

While trying to obtain the online formula for the arithmetic mean I actually used a slightly different approach, which gave me as a result a different formula which I consider to be correct. I think the formula I obtained can be rewritten as the one shown above and during the lesson, but I think it is still interesting to follow my initial reasoning to get it, as it demonstrated to myself once again that reasoning is more important than memorizing.
The approach I used is actually very similar to the “original” one.

Let’s start once again from the general formula: $$ \bar x_{n}={\frac {x_{1}+\cdots+x_{n}}{n}} $$ This can be used to calculate $ \bar x_{n}$, let’s now use it to calculate $ \bar x_{n+1}$ $$ \bar x_{n+1}={\frac {x_{1}+\cdots+x_{n}+x_{n+1}}{n+1}} $$ Let’s once again separate the fraction $$ \bar x_{n+1}={\frac {x_{1}+\cdots+x_{n}}{n+1}}+{\frac {x_{n+1}}{n+1}} $$ Multiply $$ \bar x_{n+1}=({\frac {n}{n}})\cdot{\frac {x_{1}+\cdots+x_{n}}{n+1}}+{\frac {x_{n+1}}{n+1}} $$ $$ \bar x_{n+1}=({\frac {n}{n+1}})\cdot \fcolorbox{red}{#292a2d}{$\frac {x_{1}+\cdots+x_{n}}{n}$} + {\frac {x_{n+1}}{n+1}} $$ $$ \bar x_{n+1}=({\frac {n}{n+1}})\cdot{\bar x_{n}}+{\frac {x_{n+1}}{n+1}} $$ With the final step we obtain the following formula: $$ \colorbox{#208020}{$ \Large{\bar x_{n+1} = {\frac{(\bar x_{n}\cdot n) + x_{n+1}}{n+1}} } $} $$

Final considerations

Let’s now discuss why an “online” algorithm is preferable to the “naive” one. My consideration will be focused on two key points:

  • Memory consumption
  • Computational complexity

Memory consumption

The classic naive solution is of course very heavy on memory usage, just imagine a continuos stream of data coming in, to be able to batch process all this data we would need to store every input value, but to do so we would need an infinite amount of memory which is of course impossible to have.
On the other end the online solution, provides us with a stream processing tool which only needs a constant amount of memory, which is of course very easy to provide. In particular to calculate the mean at step n ($\bar x_{n}$) we would need to hold the following values:

  • Previous mean $ \bar x_{n-1} $
  • The number of values processed so far $ n $
  • The new value to process $ x_n $

Computational complexity

The lack of infinite memory is not the only problem we face with batch processing, also the computational complexity of the naive formula is impossible to achieve. Supposing to not have a parallel algorithm to perform the additions needed, we would need to compute a linear number of additions each time the mean is updated corresponding to the number of values to process, this gives us a complexity of $O(n)$ at the n-th step and so on. Even supposing we wouldn’t need to deal with a continuos stream of data but just a fix number n of values, if we needed to compute the mean everytime a new value is provided, the total complexity at the end of the execution would be $O(n^2)$ (using Gauss’s summation). Again in this case we can lower the execution to $O(n)$ since in the online algorithm the time complexity to calculate the mean is constant $O(1)$.

Conclusion

For this reasons it is fair to conclude that in case we don’t need to save the single values, it’s always best to calculate the arithmetic mean using the online algorithm.