A Succinct Proof of Decoupled Parallel Backpropagation Convergence Lemma

The meaning of the notations below complies with the original paper. $$ \begin{align*} \mathbb{E} [f (w^{t + 1})] - f (w^t) & \leqslant \nabla f (w^t)^{\top} \mathbb{E} [(w^{t + 1} - w^t)] + \frac{L}{2} \mathbb{E} [\| w^{t + 1} - w^t \|^2]\\ & = - \gamma_t \nabla f (w^t)^{\top} \mathbb{E} \left[ \sum^K_{k = 1} \nabla f_{\mathcal{G} (k), x_i (t - K + k)} (w^{t - K + k}) \right] + \frac{L \gamma_t^2}{2} \mathbb{E} \left[ \left\| \sum^K_{k = 1} \nabla f_{\mathcal{G} (k), x_i (t - K + k)} (w^{t - K + k}) \right\|^2 \right]\\ & = - \gamma_t \| \nabla f (w^t) \|^2 - \gamma_t \nabla f (w^t)^{\top} \left( \sum_{k = 1}^K \nabla f_{\mathcal{G} (k)} (w^{t - K + k}) - \nabla f (w^t) \right) + \frac{K L \gamma_t^2}{2} \sum_{k = 1}^K \mathbb{E} [\| \nabla f_{\mathcal{G} (k), x_i (t - K + k)} (w^{t - K + k}) \|^2]\\ & \leqslant - \gamma_t \| \nabla f (w^t) \|^2 + \frac{\gamma_t}{2} \| \nabla f (w^t) \|^2 + \frac{\gamma_t}{2} \left\| \sum_{k = 1}^K \nabla f_{\mathcal{G} (k)} (w^{t - K + k}) - \nabla f (w^t) \right\|^2 + \frac{K^2 L M \gamma_t^2}{2}\\ & \leqslant - \frac{\gamma_t}{2} \| \nabla f (w^t) \|^2 + \frac{K \gamma_t}{2} \sum_{k = 1}^K \| \nabla f_{\mathcal{G} (k)} (w^{t - K + k}) - \nabla f_{\mathcal{G} (k)} (w^t) \|^2 + \frac{K^2 L M \gamma_t^2}{2}\\ & \leqslant - \frac{\gamma_t}{2} \| \nabla f (w^t) \|^2 + \frac{K \gamma_t}{2} \sum_{k = 1}^K \| \nabla f (w^{t - K + k}) - \nabla f (w^t) \|^2 + \frac{K^2 L M \gamma_t^2}{2}\\ & \leqslant - \frac{\gamma_t}{2} \| \nabla f (w^t) \|^2 + \frac{K^2 L M \gamma_t^2}{2} + \frac{K L^2 \gamma_t}{2} \sum_{k = 1}^K \| w^{t - K + k} - w^t \|^2\\ & \leqslant - \frac{\gamma_t}{2} \| \nabla f (w^t) \|^2 + \frac{K^2 L M \gamma_t^2}{2} + \frac{K^4 L^2 M^2 \sigma \gamma_t^2}{2}\\ & = - \frac{\gamma_t}{2} \| \nabla f (w^t) \|^2 + \gamma_t^2 \frac{K^2 L M}{2} (1 + K^2 L M \sigma) \end{align*} $$

April 21, 2024 · 2 min

Intuition of Universal Approximation Theorem

Universal approximation theorem states that an infinite width single layer neural network is able to approximate an arbitrary continuous function uniformly with a squashing function. It also have some stronger statement for the approximation to Borel measurable function. But continuous function is enough in our case. And we may intuitively expect that the space of all continuous functions could approximate the space of Borel measurable functions almost surely in the sense of probability. But we would not delve into the details of it. For more information, check the original paper(Multilayer Feedforward Networks are Universal Approximators). ...

April 4, 2024 · 8 min

Optimal Code Existence for Countably Infinite Sources

Huffman coding demonstrates the existence and a concrete construction for sources with finite alphabet. However, the construction fails when it comes to infinity. We would prove the existence of the optimal code for sources with a countably infinite alphabet. Notations We only use 0 and 1 to construct codewords without loss of generality and the base of $\log$ is 2 by default. $X$ is the random variable from the source whose probability distribution is $p_1 , p_2 , p_3 , \cdots$. And $l_1, l_2, l_3 \cdots$ denote the lengths of the corresponding codewords. Without loss of generality, we make $p_1 \geq p_2 \geq p_3 \geq \cdots$. By the rearrangement inequality, we may set $l_1 \leq l_2 \leq l_3 \cdots$. The problem we are considering here can be modeled as an integer optimization problem: ...

March 30, 2024 · 3 min

A Problem on `va_list` in C Language

What’s the output of the following codes? And Why? #include <stdio.h> int main(int argc, char *argv[]) { printf("%#018llx\n", (char)0x80); printf("%#018llx\n", (unsigned char)0x80); return 0; } (You might encounter warnings informing you of the inconsistency between the specified format and the given arguments. Let’s neglect them.) The answer is 0x00000000ffffff80 0x0000000000000080 Questions We have two questions: Is it overloading that contributes to different behaviors when different types of arguments are passed. Why is the first output 0x00000000ffffff80 instead of 0xffffffffffffff80? Answers The analysis is based on the printf implementation in x86 linux boot ...

November 11, 2023 · 5 min

My Paper-Reading Workflow in 2023

Main The conventional approach to storing a file involves fitting it into a hierarchical structure that necessitates a comprehensive overview of the corresponding field before the very first paper reading. You may place them flattened in an inbox folder before the tedious task of reindexing and categorizing hundreds of them hierarchically, otherwise, the overwhelming folder becomes your first obstacle to retrieve information. However, it can be deduced that both methods involve an additional burden of metal. Furthermore, the fact that papers can be organized in multiple ways compounds the disaster in a hierarchical system where one paper is only placed in one place. Therefore, an ideal structure of notetaking of papers should be bottom-up, where papers are connected like a web and a single paper can be attached to distinct categories. ...

September 1, 2023 · 3 min

Strictness of Markov Properties

A stochastic process $\{X_i\}_{i=0}^\infty $ is $n$-Markov if $$P(X_{t+n}|X_{t+n-1}, X_{t+n-2}, \cdots , X_{t}) = P(X_{t+n}|X_{t+n-1})$$for any $t \ge 0$. We would prove that An $n$-Markov stochastic process must be $m$-Markov while is not necessarily $l$-Markov where $l > n > m$ N+1 to N First, we prove an (n+1)-Markov stochastic process must be n-Markov. Proof: Suppose $\{X_i\}_{i=0}^\infty$ is an $(n+1)$-Markov stochastic process. We have $$P(X_{t+n}|X_{t+n-1}, X_{t+n-2}, \cdots, X_t) = P(X_{t+n} | X_{t+n-1})$$ for any $t \ge 0$, deriving ...

August 28, 2023 · 2 min

Some QR Codes Generated by ControlNet

I have tried HARD to generate a photo of a brain or a neuron that can be scanned. But unfortunately, either it cannot be recognized as a QR code or it is irrelevant to my prompts.😭 It is much easier to generate that of girls or natural scenery though. Perhaps it’s better to choose another stable diffusion checkpoint. (By the way, thanks to Stable Diffusion WebUI, it is quite easy to deploy famous diffusion models.) ...

August 4, 2023 · 1 min

Build Singularity/Docker Image on a Singularity Server without `sudo` Privileges

In a docker container, you have full privileges to build the image of singularity or docker in it. But if only singularity is installed on the server and the root user sets up neither --fakeroot nor proot and you have exhausted your remote build minutes, what trick can you play to work around those restrictions? Software Selection To solve the problem, we need a virtual machine under control on the server for enough privileges to execute singularity build(or docker build) which requires sudo if you meet such a tough condition as mentioned before. Then an OS is run on the virtual machine. Finally, we could build the image. ...

July 31, 2023 · 4 min

Brainstorm

Here are some immature ideas that I come up with when I am taking a walk or having a meal. Feel free to get some inspiration from this post and write some papers. I am glad to see study on these topics. This post will be updated from time to time. The research field underestimates the importance of optimization. You cannot expect one to have perfect memory. What we do every day is just adjusting our cognition. That’s what Mass Editing Memory in a Transformer does! This method remodels knowledge, namely the relation between object. But it does not involve behavioral remodeling, for example the procedure to solve a tedious math problem. Information of relations between texts and images is in cross-attention layers(e.g. prompt to prompt), while the knowledge-related relations are stored in MLP layers(e.g. https://rome.baulab.info/). Are procedural relations stored in cross-attention layers? ...

March 16, 2023 · 5 min

Existence of Optimal Policy in Markov Decision Process

In this blog, we will prove the following theorem: Optimal Policy Existence Theorem: For any Markov Decision Process, There exists an optimal policy $\pi_\star $ that is better than or equal to all other policies, $ \pi_\star \geq \pi , \forall \pi$ All optimal policies achieve the optimal value function, $V_{\pi_\star}(s)=V_\star(s)$ All optimal policies achieve the optimal action-value function $Q_{\pi_\star}(s,a)=Q_\star(s,a)$ Definition To simplify the exposition, we first define some basic concepts. ...

September 12, 2022 · 8 min