This is a great comprehensive deep dive, thank you for sharing!
Interestingly, there was a blog post and video in 2019 by Peter Bloem  with the same title which I consider to be one of the very best quick intros to transformers. Rather than diving into all of the details right away, Peter focuses on the intuition and “peels the onion” gradually.
Other notable good intros to transformers are:
The Illustrated Transformer by Jay Alammar
The Annotated Transformer by Harvard NLP
Transformers are exciting as they seem to work on all types of modalities, including vision . It makes me wonder if the transformer module captures some essence of the minicolumn structure found all over the neocortex that Jeff Hawkins raves about, citing Vernon Mountcastle. Hawkins et al talk about grid cells nowadays, location; maybe attention and context is the generalization of such a notion.
The hierarchical transformers variants are uncovering some possible optimizations that are similar to the ideas of Thousand Brains - https://arxiv.org/abs/2110.13711
The attention mechanisms, in conjunction with autoencoding, create a rough approximation of what grid cells accomplish, but transformers are still a feedforward architecture. Thanks to Moore's law, we can expand the scale of inputs to achieve human like performance, but until someone untangles the structure and devises a way of including recurrence, transformers won't be able to perform all of the functions assumed by Hawkins.
There are interesting lstm variations on transformers, but nothing public yet that really performs at the same level as the straight feedforward models. Combinatorial explosion is a bitch and lstm explodes the size and compute requirements. Hierarchical structures could constrain the requirements to something achievable.
With recurrence, you can begin to train models to perform things like discrete mathematics, as opposed to the relatively shallow semantic graphs in gpt-3 like models. The models right now don't have anything stateful that could be called memory, but with recurrence, model states will be dynamic encodings that can be processed over many cycles.
They would say no, they believe it is more like a graph probabilistic structure
It's interesting to see how human understanding differs when it comes to complex, yet clearly defined topics, like machine-learning/Transformers.
For comparison, my understanding of Transformers, after going through Peter Bloem's "Transformers from scratch" , implementing/understanding the code and the actual flow of the mathematical quantities, my understanding is that:
- Transformers consist of 3 main parts: 1. Encoders/Decoders (I/O conversion), 2. Self-attention (Indexing), 3. Feed-forward trainable network (Memory).
- The Feed-forward is the most simple kind of (an input->single-layer) neural net, actually often implemented by a Conv1d layer, which is a simple matrix multiply plus a bias and activation.
- The most interesting part is the Multi-head self-attention, which I understand as  a randomly-initialized multi-dimensional indexing system where different heads focus on different variations of the indexed token instance (token = initially e.g. a word or a part of a word) with respect to its containing sequence/context. Such encoded token instance contains information about all other tokens of the input sequence = a.k.a. self-attention, and these tokens vary based on how the given "attention head" was (randomly) initialized.
The part that really hits you is when you understand that for a Transformer, a token is not unique only due to its content/identity (and due to all other tokens in the given context/sentence), but also due to its position in the context -- e.g. to the Transformer, the word "the" at the first position is a completely different word to the word "the" on e.g. the second position (even if the rest of the context would be the same). (Which is obviously a massive waste of space if you think about it, but at the same time, at the moment, the only/best way of doing it, because it moves a massive amount of processing from inference time to the training time - which is what our current von-Neumann hardware architectures require.)
Your last point is true only with positional encodings though, attention itself is a permutation equivariant function
An little off-topic discussion: I sometimes think use of English is a bit too silo-ed.
This is talking about transformers of a mathematical nature. The same term can describe an electrical component. The word itself "trans-form-er" seems to mean "A thing that changes (the shape of?) something" - this is a very vague semantic, maybe appropriate for an abstract mathematical notion, but silly for any specific thing.
It reminds me of all the vague nouns used to name OOP objects: FooManager, FooHandler, FooController, BarContext, BarInfo, BazAdapter, BazConvertor, BazTransformer; etc etc etc.
Yeah, based on the headline I actually thought the article would be about electric power conversion for dummies. I guess both topics are on-topic for HN.
Well, at least you didn’t think it was about robots who could turn into cars. I spend way too much time watching videos with my son these days.
Same, especially since the domain gave me electrical engineering vibes.
As a Decepticon, I couldn’t agree more
I remember designing plenty of "BazAdapterFactoryManager" software in my early days of software engineering.
Sometimes I still have to maintain that software, and I remember quickly what regret feels like.
Maybe we need stronger conventions of meaning?
If this new AI concept dominates, we're gonna have to resort to calling transformers "coupled inductors".
I was like "oh cool, winding our own transformers!"
Yeah that is what I thought too. I wish that the 2017 google paper authors would have been considerate enough to not use a term from such a closely-related field to avoid unnecessary confusion.
More than meets the eye!