AI | Neumo

Working Memory in Neural Networks

Classified by their duration, there are at least three kinds of memory in humans: Working Memory(WM), Short-Term Memory(STM), and Long-Term Memory(LTM). WM can be seen as the internal state of the system that varies across the entire process. STM can be considered the memory for a milestone or key objects in a multi-stage task. And finally, LTM can be conceived as the neural network itself. Persistent knowledge is embedded in architecture. ...

Reimagining Internal Family Systems Through the Lens of AI

IFS Concepts The Internal Family Systems is a psychotherapy proposed by Richard Schwartz in the 1980s. In this model, everybody is a combination of different parts like those characters in the film, Inside Out. But the difference is that only emotions are anthropomorphized in the animation while in IFS, a part can also be a piece of thought or mindset. For example, if you have a social appearance anxiety, there would always be a voice in your mind criticizing your appearance. In IFS we consider the source this voice as a part of your mind. It is kind of like the dramaturgical theory with IFS focusing more on internal feelings and dramaturgical theory emphesizing more on social characters. ...

Notes on Sauer's Lemma

Introduction Every binary classifier is a function mapping its input, which is an element in an enumerable dataset, to 0 or 1. Equivalently, we could regard the classifier as a function $ f : \mathbb{N} \rightarrow \{ 0, 1 \} $. We have a set of hypotheses $\mathcal{H}$ from which a function is chosen to maximize the classification accuracy. It is perfect if $\mathcal{H}$ contains all possible functions $ f : \mathbb{N} \rightarrow \{ 0, 1 \} $, which indicates a universal approximator. However, when the expressiveness of our model is limited by computational cost or the size of the magnitude of parameters, it remains a problem to quantitatively measure the ability to approximate the ground truth. For example, if $\mathcal{H}$ only consists of functions which produce 1 only on one data point, namely, ...

A Succinct Proof of Decoupled Parallel Backpropagation Convergence Lemma

The meaning of the notations below complies with the original paper. $$ \begin{align*} \mathbb{E} [f (w^{t + 1})] - f (w^t) & \leqslant \nabla f (w^t)^{\top} \mathbb{E} [(w^{t + 1} - w^t)] + \frac{L}{2} \mathbb{E} [\| w^{t + 1} - w^t \|^2]\\ & = - \gamma_t \nabla f (w^t)^{\top} \mathbb{E} \left[ \sum^K_{k = 1} \nabla f_{\mathcal{G} (k), x_i (t - K + k)} (w^{t - K + k}) \right] + \frac{L \gamma_t^2}{2} \mathbb{E} \left[ \left\| \sum^K_{k = 1} \nabla f_{\mathcal{G} (k), x_i (t - K + k)} (w^{t - K + k}) \right\|^2 \right]\\ & = - \gamma_t \| \nabla f (w^t) \|^2 - \gamma_t \nabla f (w^t)^{\top} \left( \sum_{k = 1}^K \nabla f_{\mathcal{G} (k)} (w^{t - K + k}) - \nabla f (w^t) \right) + \frac{K L \gamma_t^2}{2} \sum_{k = 1}^K \mathbb{E} [\| \nabla f_{\mathcal{G} (k), x_i (t - K + k)} (w^{t - K + k}) \|^2]\\ & \leqslant - \gamma_t \| \nabla f (w^t) \|^2 + \frac{\gamma_t}{2} \| \nabla f (w^t) \|^2 + \frac{\gamma_t}{2} \left\| \sum_{k = 1}^K \nabla f_{\mathcal{G} (k)} (w^{t - K + k}) - \nabla f (w^t) \right\|^2 + \frac{K^2 L M \gamma_t^2}{2}\\ & \leqslant - \frac{\gamma_t}{2} \| \nabla f (w^t) \|^2 + \frac{K \gamma_t}{2} \sum_{k = 1}^K \| \nabla f_{\mathcal{G} (k)} (w^{t - K + k}) - \nabla f_{\mathcal{G} (k)} (w^t) \|^2 + \frac{K^2 L M \gamma_t^2}{2}\\ & \leqslant - \frac{\gamma_t}{2} \| \nabla f (w^t) \|^2 + \frac{K \gamma_t}{2} \sum_{k = 1}^K \| \nabla f (w^{t - K + k}) - \nabla f (w^t) \|^2 + \frac{K^2 L M \gamma_t^2}{2}\\ & \leqslant - \frac{\gamma_t}{2} \| \nabla f (w^t) \|^2 + \frac{K^2 L M \gamma_t^2}{2} + \frac{K L^2 \gamma_t}{2} \sum_{k = 1}^K \| w^{t - K + k} - w^t \|^2\\ & \leqslant - \frac{\gamma_t}{2} \| \nabla f (w^t) \|^2 + \frac{K^2 L M \gamma_t^2}{2} + \frac{K^4 L^2 M^2 \sigma \gamma_t^2}{2}\\ & = - \frac{\gamma_t}{2} \| \nabla f (w^t) \|^2 + \gamma_t^2 \frac{K^2 L M}{2} (1 + K^2 L M \sigma) \end{align*} $$

Intuition of Universal Approximation Theorem

Universal approximation theorem states that an infinite width single layer neural network is able to approximate an arbitrary continuous function uniformly with a squashing function. It also have some stronger statement for the approximation to Borel measurable function. But continuous function is enough in our case. And we may intuitively expect that the space of all continuous functions could approximate the space of Borel measurable functions almost surely in the sense of probability. But we would not delve into the details of it. For more information, check the original paper(Multilayer Feedforward Networks are Universal Approximators). ...

Brainstorm

Here are some immature ideas that I come up with when I am taking a walk or having a meal. Feel free to get some inspiration from this post and write some papers. I am glad to see study on these topics. This post will be updated from time to time. The research field underestimates the importance of optimization. You cannot expect one to have perfect memory. What we do every day is just adjusting our cognition. That’s what Mass Editing Memory in a Transformer does! This method remodels knowledge, namely the relation between object. But it does not involve behavioral remodeling, for example the procedure to solve a tedious math problem. Information of relations between texts and images is in cross-attention layers(e.g. prompt to prompt), while the knowledge-related relations are stored in MLP layers(e.g. https://rome.baulab.info/). Are procedural relations stored in cross-attention layers? ...

Existence of Optimal Policy in Markov Decision Process

In this blog, we will prove the following theorem: Optimal Policy Existence Theorem: For any Markov Decision Process, There exists an optimal policy $\pi_\star $ that is better than or equal to all other policies, $ \pi_\star \geq \pi , \forall \pi$ All optimal policies achieve the optimal value function, $V_{\pi_\star}(s)=V_\star(s)$ All optimal policies achieve the optimal action-value function $Q_{\pi_\star}(s,a)=Q_\star(s,a)$ Definition To simplify the exposition, we first define some basic concepts. ...