Posts

解构主义方块

图为成品，更高清的视频滑到最下面。缘由 DD 制作了一个钥匙扣，正面是四个“喜”，反面是四个“苦” 这让我想到《哥德尔，艾舍尔，巴赫：集异璧之大成》的封面，一个从三个方向看上去是三个不同字母的构造物。图源自维基百科那么我们是否能够做出一个从三个角度看得到三个不同汉字的构造物呢？其中汉字的选取范围为中文互联网解构主义字符集“孝典批乐急蚌润麻赢”。思考通过结构自身的形状呈现出想要的图案的，我们称之为阳文；通过结构的镂空呈现出想要的形状的，我们称之为阴文。例如《集异璧》的封面就是阳文。阴文的情况存在一个平凡解，即将三块同样大小的正方体木板两两正交粘在一起，然后直接在三个面上进行镂空处理。只要三个字符中任何一个连通区域都是单连通的，即不存在“口”这样的洞，这个结构就可以在三个方向上呈现出镂空的图案，就像下面这样。而另一方面，不能“有洞”恰恰是阴文最大的缺陷。所以我们下面讨论阳文情况。在这个帖子中，大家探讨了怎样的三个图形能够成功做出这样一个结构。这里简单复述：我们不妨设三个面上的图案是 $[0,1]^2$ 的子集$A_1, A_2, A_3$, 则最大可能的候选者是这个集合 $$ S = \{\,(x,y,z)\in [0,1]^3\mid (y,z)\in A_1, (x,z)\in A_2, (x,y)\in A_3\,\} $$然而可能由于 $A_2, A_3$ 的切割，导致 $A_1$ 的图案无法显示完全。例如你不可能在三个面上都显示“一”。所以我们需要进一步验证是否对于任意 $(y,z)\in A_1$，都存在$x\in [0,1]$，使得 $(x,y)\in A_3$ 且 $(x,z)\in A_2$，同理对于另外两个面也有同样的要求。对于一般的图案，大概率这个判据是无法满足的。但是我们注意到一组特殊解：如果存在 $x_0, y_0, z_0 \in [0,1]$ 使得对于任意 $x,y,z\in [0,1]$ 都有 $$(y_0, z), (y,z_0)\in A_1, \ (x_0, z),(x, z_0)\in A_2, \ (x_0,y),(x,y_0)\in A_3,$$ 那么 $S$ 就是符合要求的构造。 ...

Working Memory in Neural Networks

Classified by their duration, there are at least three kinds of memory in humans: Working Memory(WM), Short-Term Memory(STM), and Long-Term Memory(LTM). WM can be seen as the internal state of the system that varies across the entire process. STM can be considered the memory for a milestone or key objects in a multi-stage task. And finally, LTM can be conceived as the neural network itself. Persistent knowledge is embedded in architecture. ...

Comprend Drop Check with Simple Examples

Concepts in Drop Check What’s the drop glue? Why do we need #[may_dangle]? In which case is a PhantomData necessary? If you don’t know the answer the questions above, this post would show you the answers by concrete examples. This page and this page in Rustonomicon and this page in the RFC book may be helpful to you. But if you find them difficult to understand like what I feel about them or too cumbersome and abstract, this post is what you want. Additionally, reading materials from different sources has a synergic effect. So although I believe this post is enough, you are encouraged to turn to other resources like those links I mentioned above if you are stuck in some part of the post. Leave me a comment when you encounter those moments. I will try to improve it. ...

Reimagining Internal Family Systems Through the Lens of AI

IFS Concepts The Internal Family Systems is a psychotherapy proposed by Richard Schwartz in the 1980s. In this model, everybody is a combination of different parts like those characters in the film, Inside Out. But the difference is that only emotions are anthropomorphized in the animation while in IFS, a part can also be a piece of thought or mindset. For example, if you have a social appearance anxiety, there would always be a voice in your mind criticizing your appearance. In IFS we consider the source this voice as a part of your mind. It is kind of like the dramaturgical theory with IFS focusing more on internal feelings and dramaturgical theory emphesizing more on social characters. ...

Notes on Sauer's Lemma

Introduction Every binary classifier is a function mapping its input, which is an element in an enumerable dataset, to 0 or 1. Equivalently, we could regard the classifier as a function $ f : \mathbb{N} \rightarrow \{ 0, 1 \} $. We have a set of hypotheses $\mathcal{H}$ from which a function is chosen to maximize the classification accuracy. It is perfect if $\mathcal{H}$ contains all possible functions $ f : \mathbb{N} \rightarrow \{ 0, 1 \} $, which indicates a universal approximator. However, when the expressiveness of our model is limited by computational cost or the size of the magnitude of parameters, it remains a problem to quantitatively measure the ability to approximate the ground truth. For example, if $\mathcal{H}$ only consists of functions which produce 1 only on one data point, namely, ...

A Succinct Proof of Decoupled Parallel Backpropagation Convergence Lemma

The meaning of the notations below complies with the original paper. $$ \begin{align*} \mathbb{E} [f (w^{t + 1})] - f (w^t) & \leqslant \nabla f (w^t)^{\top} \mathbb{E} [(w^{t + 1} - w^t)] + \frac{L}{2} \mathbb{E} [\| w^{t + 1} - w^t \|^2]\\ & = - \gamma_t \nabla f (w^t)^{\top} \mathbb{E} \left[ \sum^K_{k = 1} \nabla f_{\mathcal{G} (k), x_i (t - K + k)} (w^{t - K + k}) \right] + \frac{L \gamma_t^2}{2} \mathbb{E} \left[ \left\| \sum^K_{k = 1} \nabla f_{\mathcal{G} (k), x_i (t - K + k)} (w^{t - K + k}) \right\|^2 \right]\\ & = - \gamma_t \| \nabla f (w^t) \|^2 - \gamma_t \nabla f (w^t)^{\top} \left( \sum_{k = 1}^K \nabla f_{\mathcal{G} (k)} (w^{t - K + k}) - \nabla f (w^t) \right) + \frac{K L \gamma_t^2}{2} \sum_{k = 1}^K \mathbb{E} [\| \nabla f_{\mathcal{G} (k), x_i (t - K + k)} (w^{t - K + k}) \|^2]\\ & \leqslant - \gamma_t \| \nabla f (w^t) \|^2 + \frac{\gamma_t}{2} \| \nabla f (w^t) \|^2 + \frac{\gamma_t}{2} \left\| \sum_{k = 1}^K \nabla f_{\mathcal{G} (k)} (w^{t - K + k}) - \nabla f (w^t) \right\|^2 + \frac{K^2 L M \gamma_t^2}{2}\\ & \leqslant - \frac{\gamma_t}{2} \| \nabla f (w^t) \|^2 + \frac{K \gamma_t}{2} \sum_{k = 1}^K \| \nabla f_{\mathcal{G} (k)} (w^{t - K + k}) - \nabla f_{\mathcal{G} (k)} (w^t) \|^2 + \frac{K^2 L M \gamma_t^2}{2}\\ & \leqslant - \frac{\gamma_t}{2} \| \nabla f (w^t) \|^2 + \frac{K \gamma_t}{2} \sum_{k = 1}^K \| \nabla f (w^{t - K + k}) - \nabla f (w^t) \|^2 + \frac{K^2 L M \gamma_t^2}{2}\\ & \leqslant - \frac{\gamma_t}{2} \| \nabla f (w^t) \|^2 + \frac{K^2 L M \gamma_t^2}{2} + \frac{K L^2 \gamma_t}{2} \sum_{k = 1}^K \| w^{t - K + k} - w^t \|^2\\ & \leqslant - \frac{\gamma_t}{2} \| \nabla f (w^t) \|^2 + \frac{K^2 L M \gamma_t^2}{2} + \frac{K^4 L^2 M^2 \sigma \gamma_t^2}{2}\\ & = - \frac{\gamma_t}{2} \| \nabla f (w^t) \|^2 + \gamma_t^2 \frac{K^2 L M}{2} (1 + K^2 L M \sigma) \end{align*} $$

Intuition of Universal Approximation Theorem

Universal approximation theorem states that an infinite width single layer neural network is able to approximate an arbitrary continuous function uniformly with a squashing function. It also have some stronger statement for the approximation to Borel measurable function. But continuous function is enough in our case. And we may intuitively expect that the space of all continuous functions could approximate the space of Borel measurable functions almost surely in the sense of probability. But we would not delve into the details of it. For more information, check the original paper(Multilayer Feedforward Networks are Universal Approximators). ...

Optimal Code Existence for Countably Infinite Sources

Huffman coding demonstrates the existence and a concrete construction for sources with finite alphabet. However, the construction fails when it comes to infinity. We would prove the existence of the optimal code for sources with a countably infinite alphabet. Notations We only use 0 and 1 to construct codewords without loss of generality and the base of $\log$ is 2 by default. $X$ is the random variable from the source whose probability distribution is $p_1 , p_2 , p_3 , \cdots$. And $l_1, l_2, l_3 \cdots$ denote the lengths of the corresponding codewords. Without loss of generality, we make $p_1 \geq p_2 \geq p_3 \geq \cdots$. By the rearrangement inequality, we may set $l_1 \leq l_2 \leq l_3 \cdots$. The problem we are considering here can be modeled as an integer optimization problem: ...

A Problem on `va_list` in C Language

What’s the output of the following codes? And Why? #include <stdio.h> int main(int argc, char *argv[]) { printf("%#018llx\n", (char)0x80); printf("%#018llx\n", (unsigned char)0x80); return 0; } (You might encounter warnings informing you of the inconsistency between the specified format and the given arguments. Let’s neglect them.) The answer is 0x00000000ffffff80 0x0000000000000080 Questions We have two questions: Is it overloading that contributes to different behaviors when different types of arguments are passed. Why is the first output 0x00000000ffffff80 instead of 0xffffffffffffff80? Answers The analysis is based on the printf implementation in x86 linux boot ...

My Paper-Reading Workflow in 2023

Main The conventional approach to storing a file involves fitting it into a hierarchical structure that necessitates a comprehensive overview of the corresponding field before the very first paper reading. You may place them flattened in an inbox folder before the tedious task of reindexing and categorizing hundreds of them hierarchically, otherwise, the overwhelming folder becomes your first obstacle to retrieve information. However, it can be deduced that both methods involve an additional burden of metal. Furthermore, the fact that papers can be organized in multiple ways compounds the disaster in a hierarchical system where one paper is only placed in one place. Therefore, an ideal structure of notetaking of papers should be bottom-up, where papers are connected like a web and a single paper can be attached to distinct categories. ...