Theory

Notes on Sauer's Lemma

Introduction Every binary classifier is a function mapping its input, which is an element in an enumerable dataset, to 0 or 1. Equivalently, we could regard the classifier as a function $ f : \mathbb{N} \rightarrow \{ 0, 1 \} $. We have a set of hypotheses $\mathcal{H}$ from which a function is chosen to maximize the classification accuracy. It is perfect if $\mathcal{H}$ contains all possible functions $ f : \mathbb{N} \rightarrow \{ 0, 1 \} $, which indicates a universal approximator. However, when the expressiveness of our model is limited by computational cost or the size of the magnitude of parameters, it remains a problem to quantitatively measure the ability to approximate the ground truth. For example, if $\mathcal{H}$ only consists of functions which produce 1 only on one data point, namely, ...

A Succinct Proof of Decoupled Parallel Backpropagation Convergence Lemma

The meaning of the notations below complies with the original paper. $$ \begin{align*} \mathbb{E} [f (w^{t + 1})] - f (w^t) & \leqslant \nabla f (w^t)^{\top} \mathbb{E} [(w^{t + 1} - w^t)] + \frac{L}{2} \mathbb{E} [\| w^{t + 1} - w^t \|^2]\\ & = - \gamma_t \nabla f (w^t)^{\top} \mathbb{E} \left[ \sum^K_{k = 1} \nabla f_{\mathcal{G} (k), x_i (t - K + k)} (w^{t - K + k}) \right] + \frac{L \gamma_t^2}{2} \mathbb{E} \left[ \left\| \sum^K_{k = 1} \nabla f_{\mathcal{G} (k), x_i (t - K + k)} (w^{t - K + k}) \right\|^2 \right]\\ & = - \gamma_t \| \nabla f (w^t) \|^2 - \gamma_t \nabla f (w^t)^{\top} \left( \sum_{k = 1}^K \nabla f_{\mathcal{G} (k)} (w^{t - K + k}) - \nabla f (w^t) \right) + \frac{K L \gamma_t^2}{2} \sum_{k = 1}^K \mathbb{E} [\| \nabla f_{\mathcal{G} (k), x_i (t - K + k)} (w^{t - K + k}) \|^2]\\ & \leqslant - \gamma_t \| \nabla f (w^t) \|^2 + \frac{\gamma_t}{2} \| \nabla f (w^t) \|^2 + \frac{\gamma_t}{2} \left\| \sum_{k = 1}^K \nabla f_{\mathcal{G} (k)} (w^{t - K + k}) - \nabla f (w^t) \right\|^2 + \frac{K^2 L M \gamma_t^2}{2}\\ & \leqslant - \frac{\gamma_t}{2} \| \nabla f (w^t) \|^2 + \frac{K \gamma_t}{2} \sum_{k = 1}^K \| \nabla f_{\mathcal{G} (k)} (w^{t - K + k}) - \nabla f_{\mathcal{G} (k)} (w^t) \|^2 + \frac{K^2 L M \gamma_t^2}{2}\\ & \leqslant - \frac{\gamma_t}{2} \| \nabla f (w^t) \|^2 + \frac{K \gamma_t}{2} \sum_{k = 1}^K \| \nabla f (w^{t - K + k}) - \nabla f (w^t) \|^2 + \frac{K^2 L M \gamma_t^2}{2}\\ & \leqslant - \frac{\gamma_t}{2} \| \nabla f (w^t) \|^2 + \frac{K^2 L M \gamma_t^2}{2} + \frac{K L^2 \gamma_t}{2} \sum_{k = 1}^K \| w^{t - K + k} - w^t \|^2\\ & \leqslant - \frac{\gamma_t}{2} \| \nabla f (w^t) \|^2 + \frac{K^2 L M \gamma_t^2}{2} + \frac{K^4 L^2 M^2 \sigma \gamma_t^2}{2}\\ & = - \frac{\gamma_t}{2} \| \nabla f (w^t) \|^2 + \gamma_t^2 \frac{K^2 L M}{2} (1 + K^2 L M \sigma) \end{align*} $$

Intuition of Universal Approximation Theorem

Universal approximation theorem states that an infinite width single layer neural network is able to approximate an arbitrary continuous function uniformly with a squashing function. It also have some stronger statement for the approximation to Borel measurable function. But continuous function is enough in our case. And we may intuitively expect that the space of all continuous functions could approximate the space of Borel measurable functions almost surely in the sense of probability. But we would not delve into the details of it. For more information, check the original paper(Multilayer Feedforward Networks are Universal Approximators). ...

Strictness of Markov Properties

A stochastic process $\{X_i\}_{i=0}^\infty $ is $n$-Markov if $$P(X_{t+n}|X_{t+n-1}, X_{t+n-2}, \cdots , X_{t}) = P(X_{t+n}|X_{t+n-1})$$for any $t \ge 0$. We would prove that An $n$-Markov stochastic process must be $m$-Markov while is not necessarily $l$-Markov where $l > n > m$ N+1 to N First, we prove an (n+1)-Markov stochastic process must be n-Markov. Proof: Suppose $\{X_i\}_{i=0}^\infty$ is an $(n+1)$-Markov stochastic process. We have $$P(X_{t+n}|X_{t+n-1}, X_{t+n-2}, \cdots, X_t) = P(X_{t+n} | X_{t+n-1})$$ for any $t \ge 0$, deriving ...

Existence of Optimal Policy in Markov Decision Process

In this blog, we will prove the following theorem: Optimal Policy Existence Theorem: For any Markov Decision Process, There exists an optimal policy $\pi_\star $ that is better than or equal to all other policies, $ \pi_\star \geq \pi , \forall \pi$ All optimal policies achieve the optimal value function, $V_{\pi_\star}(s)=V_\star(s)$ All optimal policies achieve the optimal action-value function $Q_{\pi_\star}(s,a)=Q_\star(s,a)$ Definition To simplify the exposition, we first define some basic concepts. ...