Current generative AI is an AI technology that blossomed with the invention of the Transformer as a major breakthrough.
The Attention Mechanism can be described as the defining feature of the Transformer. This is succinctly expressed in the title of the paper announcing the Transformer: “Attention Is All You Need.”
The background to this is that AI researchers at the time were making various efforts and engaging in trial-and-error to enable AI to handle natural language as skillfully as humans, naming successful methods and publishing papers about them.
Many researchers believed that by combining these multiple effective mechanisms in diverse ways, they could gradually create AI capable of handling natural language like humans. They were focused on discovering new mechanisms that could function in combination with others, and finding the optimal combinations of these mechanisms.
However, the Transformer overturned this conventional wisdom. The message that there was no need to combine various mechanisms, and that only the Attention Mechanism was needed, was evident in the paper's title.
While the Transformer itself certainly incorporates various mechanisms, there is no doubt that the Attention Mechanism was particularly groundbreaking and distinctive among them.
Overview of the Attention Mechanism
The Attention Mechanism is a system that allows AI to learn which words, among the many contained in preceding sentences, it should pay attention to when processing a specific word in natural language.
This enables the AI to accurately understand what a word refers to, for example, when dealing with demonstratives like "this," "that," or "the aforementioned" (pointing to a word in a previous sentence), or positional references like "the opening sentence," "the second example listed," or "the preceding paragraph."
Furthermore, it can correctly interpret modifiers even if they are far apart in a sentence, and even in long texts, it can interpret words without losing the context to which the current word refers, preventing it from getting lost among other sentences.
This is the utility of "attention."
Conversely, this means that when interpreting the word currently being processed, unnecessary words are masked and removed from the interpretation.
By retaining only the words essential for interpreting a given word and removing irrelevant ones, the set of words to be interpreted remains limited to a few, no matter how long the text, thus preventing the interpretational density from becoming diluted.
Virtual Intelligence
Now, shifting gears a bit, I've been contemplating the concept of virtual intelligence.
Currently, when using generative AI for business, if all information within a company is consolidated and provided as a single knowledge base to the AI, the sheer volume of knowledge can become overwhelming, leading to the phenomenon where the AI cannot properly process it.
For this reason, it is more effective to separate knowledge by task, preparing AI chats for each task or creating AI tools specialized for specific operations.
Consequently, when performing complex tasks, it becomes necessary to combine these AI chats or AI tools, each with their segregated knowledge.
While this represents a current limitation of generative AI, fundamentally, even with future generative AI, focusing solely on the knowledge required for a specific task should lead to higher accuracy.
Instead, I believe that future generative AI will be able to internally differentiate and utilize the necessary knowledge according to the situation, even without humans having to segment that knowledge.
This capability is virtual intelligence. It's like a virtual machine that can run multiple different operating systems on a single computer. It means that within a single intelligence, multiple virtual intelligences with different specializations can function.
Even current generative AI can already simulate discussions among multiple people or generate stories featuring multiple characters. Therefore, virtual intelligence is not a special capability but rather an extension of current generative AI.
Micro Virtual Intelligence
The mechanism of virtual intelligence, which narrows down the necessary knowledge according to the task, performs something similar to the Attention Mechanism.
In other words, it is analogous to the Attention Mechanism in that it focuses on and processes only the relevant knowledge based on the task currently being executed.
Conversely, the Attention Mechanism can be said to be a mechanism that realizes something akin to virtual intelligence. However, while the virtual intelligence I envision selects relevant knowledge from a collection of knowledge, the Attention Mechanism operates at the unit of a collection of words.
For this reason, the Attention Mechanism can be called Micro Virtual Intelligence.
Explicit Attention Mechanism
If we view the Attention Mechanism as micro virtual intelligence, then, conversely, the virtual intelligence I previously mentioned can be achieved by constructing a macro attention mechanism.
And this macro attention mechanism does not need to be added to the internal structure of large language models or involve neural network learning.
It can simply be an explicit statement written in natural language, such as: "When executing Task A, refer to Knowledge B and Knowledge C."
This clarifies the knowledge required for Task A. This statement itself is a type of knowledge.
This could be called an Explicit Attention Mechanism. This statement can be regarded as Attention Knowledge, which explicitly articulates the knowledge that should be focused on when performing Task A.
Furthermore, this Attention Knowledge can be generated or updated by generative AI.
If a task fails due to a lack of knowledge, the Attention Knowledge can be updated to include additional knowledge as a reference for that task, based on this reflection.
Conclusion
The Attention Mechanism has dramatically advanced the capabilities of generative AI.
It wasn't merely a mechanism that happened to work well; rather, as we have seen here, the very mechanism of dynamically narrowing down information to be referenced for each situation seems to be the essence of advanced intelligence.
And, like virtual intelligence and explicit attention knowledge, the Attention Mechanism is also key to recursively enhancing intelligence across various layers.