Shannon Entropy, Meaning, And Generative AI
Information theory, based on the work of Claude Shannon, reduced signals to their probabilistic essence; now generative AI reduces signals to their semantic essence.
At its core, artificial intelligence (AI) is based on algorithms that identify statistical correlations between the states of signals. Various forms of data, including text, images, sound, and video, are signals from an information-theoretic perspective. I use the term "signal" rather than "information" here because the terms "information" and "meaning" are often inseparable in colloquial usage.
Training an AI means exposing it to signals until it can adequately recognize patterns in signals it hasn't been exposed to, providing the basis for tasks such as classification, prediction of states, and even generation of new signals. The Internet was the umbilical cord that nourished AI, providing the flow of information to adjust its internal framework and reflect the patterns it encountered in its matrices (matrix is actually Latin for womb). Perhaps the Information Age culminated in the Internet, which merged signal storage (on servers) with signal transmission into a state of signal persistence. Now we are at the dawn of what Ray Kurzweil in 1990 called "The Age of Intelligent Machines".
But what is information? In his 1948 paper "A Mathematical Theory of Communication," American computer scientist Claude Shannon found a measure of "information, choice, and uncertainty": information entropy, H(x), which is "independent of the means used to generate the information" and "invariant to the form," as Myron Tribus, former director of the MIT Center for Advanced Engineering Study, wrote in a brilliant 1971 article in Scientific American. Shannon did this by defining information entropy in terms of the probability of a signal's states. The more predictable a state, the less information it contains. Thus, a perfectly predictable state contains no information. Conversely, an unpredictable or random state, such as a coin flip, contains a lot of information.
If we know all the patterns of a system - all the rules of order - we can describe the system in the most efficient way, without any uncertainty or information entropy. Complete randomness, on the other hand, implies a high degree of information entropy; extensive communication is required to accurately transmit a random signal, since randomness lacks predictable patterns, making any signal compression impossible. Therefore, information entropy, often referred to as Shannon entropy, can be thought of as a deviation from a predictable pattern or order. Thus, information entropy also implies the degree of compressibility of signals. Most of the algorithms used to compress signals for efficient processing, storage, and transmission, on which the Internet, cellular networks, and various data formats rely, are based on Shannon's information theory.
To get an idea of Shannon's concept, imagine a chessboard: It consists of 8 x 8 = 64 squares. Let's index the rows and columns accordingly, and call squares "even" if the sum of the indices is even, and "odd" if the sum of the indices is odd. We could describe a chessboard by saying that even squares are black and odd squares are white (Figure A). Such a description is already a form of data compression.
Let's call the probability that a square is black "p" and that a square is white "q". A "deterministic" chessboard can be described with p=1 and q=0 for even squares and p=0 and q=1 for odd squares. Since we know everything about this system, the Shannon entropy is 0 bits (Shannon defined bit as the unit of information entropy). We can convert this deterministic chess pattern into a probabilistic form: If we set p=0.99 (0.9) and q=0.01 (0.1) for even squares and p=0.01 (0.1) and q=0.99 (0.9) for odd squares, we get (more or less) noisy chessboards (Figures B and C). For comparison, imagine a "chessboard" where it is completely random whether the squares are black or white, i.e. p=q=0.5 for all squares (Figure D). We cannot find any patterns in this system of equiprobable states, resulting in the maximum entropy for a system of 64 states: 64 bits.
Shannon entropy is often introduced with the following idea: it correlates with the maximum number of informative questions with a binary answer (yes or no) that we must ask (on average) to accurately describe a system. With complete randomness, we have to ask 64 questions to get all the information about our random chess board (no square is predictable). In fact, this idea works for the (more or less) noisy chessboards as well: We could start with the most informative question, such as "Is every even square black and every odd square white in, say, the top half of the board?" and continue until we know the state of every square. This allows us to accurately describe a chessboard with less than 64 questions, depending on the information entropy.
For Shannon, the concept of information was not the same as it is used in everyday language. Shannon was primarily interested in optimizing signal transmission through (noisy) channels and used a bottom-up approach to describe information within the "microscopic" states of a signal. In Shannon's theoretical framework, the meaning of the information did not matter: a completely meaningless signal can have a high information content. This is the main difference from the everyday concept of information, where meaning and information are often interchangeable.
In their groundbreaking 2017 paper that ushered in the age of generative AI, "Attention is all you need," Vaswani et al. reported that “not only do individual attention heads clearly learn to perform different tasks, many appear to exhibit behavior related to the syntactic and semantic structure of the sentences.” Transformer-based neural networks equipped with "behavior related to [...] semantic structure of sentences" culminated 5 years later in the release of ChatGPT.
The concept of entropy originally comes from thermodynamics and describes the state of disorder or randomness of a system. Semantics and meaning ultimately result in patterns that limit entropy in systems with which humans can communicate (e.g. language, images, sound). Equipped with attention mechanisms, generative transformer-based neural networks take meaning into account and reduce semantic entropy by recognizing the underlying patterns. Without going into the details of how generative AI works, reduced semantic entropy is the basis for AI to understand our questions and answer them appropriately. Generative AI can be thought of as a top-down approach to processing the semantic connections between the states of signals that can only be recognized from a macroscopic point of view. The resulting automated reduction of semantic entropy is a new way of dealing with information that extends Shannon's concept of information entropy: Shannon reduced signals to their probabilistic essence, now generative AI reduces signals to their semantic essence - or as Ray Kurzweil wrote in his 1999 bestseller The Age of Spiritual Machines: "The purposeful destruction of information is the essence of intelligent work."