Artigo Revisado por pares

The Elements of Intelligence

2023; The MIT Press; Volume: 29; Issue: 3 Linguagem: Inglês

10.1162/artl_a_00410

ISSN

1530-9185

Autores

Christoph Adami,

Tópico(s)

Cognitive Science and Mapping

Resumo

Can machines ever be sentient? Could they perceive and feel things, be conscious of their surroundings? What are the prospects of achieving sentience in a machine? What are the dangers associated with such an endeavor, and is it even ethical to embark on such a path to begin with? In the series of articles of this column, I discuss one possible path toward "general intelligence" in machines: to use the process of Darwinian evolution to produce artificial brains that can be grafted onto mobile robotic platforms, with the goal of achieving fully embodied sentient machines.After reviewing the history of Artificial Intelligence research (Adami, 2021) and discussing the components, topology, and optimization methods used in artificial neural network research (Adami, 2022), we now take a step back to ask ourselves, What is intelligence? In our quest to evolve an intelligent system, this is not an idle question. In fact, asking this question will help us focus on essential features of what we call intelligence, rather than being distracted by incidental attributes. Our answer will be guided by the principle that intelligence is an evolutionary response to uncertain environments: that the primary purpose of intelligence is to increase the organism's fitness.Just as it is unlikely that there will ever be a unique and universal definition of intelligence, it is also unlikely that there will be widespread agreement about what the processes are that contribute to intelligence: the elements of intelligence. The five elements that I will discuss here are rooted in the idea that intelligence is a (biological or computational) trait that enables its bearer to reduce the uncertainty about the world in which it lives (both in time and space) and harness the information it has gained to succeed against its competitors, cooperate with its supporters, and extract the resources it needs from its environment without coming to harm. Recognizing who is friend and who is foe (and using information to defeat the foe and support the friend) ultimately leads to a greater number of offspring.To leverage information in support of organismal fitness, the organism needs to perceive the environment; extract the salient features (those that matter to the organism and can be perceived by its sensory system); make predictions and plans based on the sensed world as well as on what was learned from experience; and, finally, act according to those predictions.1 Such a view of intelligence is very much aligned with the "knowledge-level systems" view of the late 20th century (Anderson, 1983; Newell, 1990), except that those attempts to formulate a "unified theory of cognition" made no attempt to quantify said knowledge in terms of information. An information-theoretic view of intelligence and cognition has the advantage that it can quantify the relation between the "symbols" manipulated by the knowledge system and the things in the physical world that they represent. This is important because, historically, one of the most common criticisms of attempts to formalize (and ultimately engineer) thinking systems conjured up an apparent dichotomy between the "zeros and ones" of computer systems, which are devoid of intrinsic meaning ("strings by themselves can't have any meaning"; Searle, 1984, p. 31), and the fact that "thoughts are about things." Information theory quantifies precisely that link, both in computers and in people.Whereas most theories of cognition posit that sensing and acting are integral elements of intelligence because they are clearly part of the "sensory–action loop" (Bongard & Pfeifer, 2001; Clark, 2016; Newell, 1990), here I take the point of view that the sensors and motors themselves are "given" (even though cognition does affect sensing and acting), and I discuss only the elements of intelligence that take place within the neurons of the brain, excluding sensors and motors (often called "peripheral neurons"; McCulloch & Pitts, 1943).2We will see that the elements of intelligence that I will discuss—categorization, memory, prediction, learning, and representation—are all tied explicitly to how information is acquired, shaped, stored, and manipulated.To make sense of the world that we perceive, it is imperative that we can tell one thing from another. How do we do that? How is it that a visual scene (say) evokes in the brain a set of objects and their relationships with each other, rather than a jumbled mishmash of colors and shapes? After all, the shapes and colors we perceive are not discrete but rather form a continuum. For us to be able to differentiate between things, we first need to be able to categorize.Our ability to place objects in the world (and behaviors and ways in which objects relate to behaviors) into categories is a crucial skill that develops early in infancy (Quinn & Eimas, 1997). According to psychologists (see, e.g., Karmiloff-Smith, 1992), categories (collections or classes of objects and events that exist in the world) are formed via a perceptual analysis that filters the raw data, leaving behind an abstract representation in the form of image schemas. We all carry such image schemas with us. If I were to ask you to imagine a chair, for example, you could do that very easily, and even though you might imagine a straight-backed chair with four legs, you would not regard a three-legged stool as something completely outside of this category. In fact, we are able to subsume thousands of different shapes under this one category "chair." I propose that this ability to form categories is a central element of intelligence: all others that we will discuss build on this one. In particular, building categories allows us to quantify how much there is to know, using the information-theoretic concept of Shannon entropy (Shannon, 1948), a measure of uncertainty.Claude Shannon, the creator of information theory, called his measure entropy because a very similar quantity had been introduced in statistical physics much earlier.3 For the purpose of understanding the concept within the context of cognitive science (and to see its relation with the concept of "information"), I will use the word uncertainty instead of entropy. Shannon defined his uncertainty concept both for continuous ("blurry") quantities and for discrete (or "sharp") ones.Let us first write down Shannon's uncertainty function in the discrete case. To do this, we have to introduce the concept of a variable X that can take on n discrete states xi, i = 1 … n. For the purpose of describing categorization, we might then ask, How do we associate objects in the world that have continuous shapes and colors and features with only one of the n categories defined by variable X? This process (called coarse-graining in the literature; see, e.g., Feynman, 1974), is without a doubt a complex one, involving (as I will describe) a shift from perceptual characters that are described by continuous values to conceptual ones described by discrete values. For discrete categories, Shannon's (1948) uncertainty function is given byH(X)=−∑i=1np(xi)logp(xi),(1)where p(xi) is the likelihood of encountering an object within category xi in world X. In general, the number of categories (as well as the "distance" between different categories) depends on how useful the differentiation is. For example, in some situations, it might be relevant to make a distinction between two categories (say, "chair" and "stool") that is not necessary in others. In other words, the brain tends to operate with just the categories that are necessary to best understand (and predict) the world, given the particular circumstances.How do categories emerge? This is a difficult question to answer, because although they clearly emerge over time via a process of use, feedback, and learning, those processes themselves are somewhat vague. Furthermore, some categories are clearly innate: The fear of the color red in certain birds (Pryke, 2009) is one such example. In a very real sense, categories evolve. Here we will think of the process of categorization as creating a certain number of image schemas that represent the different categories.4 Generally speaking, we can say that categories emerge so that there is a balance between a large enough number of different images to be able to describe the range of salient differences and a small enough number that manipulating these images in one's head (or wherever they are stored) is not too cumbersome. This emergence of categories is often described by a shift from perceptual representations (images that look very much like the object in question) to conceptual representations that have lost most of the similarities to the actual objects. When representations become so conceptual that there is no resemblance to the original perceptual artifact, such representations are often called symbolic. We should note that even perceptual representations already involve a certain kind of "filtering," because the sensory system can perceive only a finite range of values. So, for example, whereas the ultraviolet characteristics of flower petals do not enter our perceptual representations of them, they without a doubt do so in hummingbirds (and for all we know, even in mice; see Cronin & Bok, 2016).Perhaps a typical example of the evolution of a perceptual representation from conceptual to a symbolic one is the evolution of the Sumerian cuneiform script. The Sumerian script is the earliest known writing system and is believed to have emerged around 8000 BC (Kramer, 1981). Originally, the "tokens" the Sumerians used were pictograms representing the objects to which they directly refer. Over time, the pictograms became stylized and morphed into symbols, adapting to the method people used to record the symbols, namely, via strokes made by pressing a stylus into clay. A typical sequence that changes the pictogram for "head" into a glyph representing the concept is shown in Figure 1. Note that during the evolution of glyphs, the total number of symbols changes also, usually by eliminating symbols than can be rendered via a combination of others (Kramer, 1981).Although learning about the objects in the world and categorizing them is a necessary precursor to intelligent behavior, it does not specify what action to take to reduce uncertainty. We can say quite generally that what reduces uncertainty is information. Indeed, eons of evolution have stored massive amounts of information about our world in our genome (a very rough estimate suggests about half a billion bits of information5). However, to survive and thrive in a world that changes over a lifetime, we need to store information in a different medium: We need memory.Shannon's measure for information can be seen as a difference between uncertainties: the uncertainty we have if nothing is known minus our current uncertainty, that is, the remaining uncertainty given all the knowledge we have. So, if we consider again our world X, in which we expect to find objects in categories x1 to xn with probabilities p(xi), then the highest uncertainty would occur if all categories were equally likely in this world: p(xi) = 1/n. In that case, this maximal uncertainty would beHmax=logn.(2)If we know instead that some categories occur with much greater likelihood than others (while some others might even have p(xi) = 0), then this knowledge is quantified by the informationI=Hmax−H(X)=log(n)+∑i=1np(xi)logp(xi).(3)One of the great advantages of information written as a difference of uncertainties is that when discretizing Shannon's continuous-valued differential entropy to the discrete version, a "renormalization" constant that is related to the level of discretization cancels from the expression of information, telling us that, in the end, only the measurable differences matter.Storing information is crucial to making "informed" decisions. In our brains, information acquired in the past is stored in the connections between neurons and can be retrieved and integrated with the information streaming through our senses. This capacity to remember, and to evaluate the current sensory stream in the context of past experience, is crucial for intelligent behavior. Memory is the mandatory ingredient in two other elements of intelligence I discuss later: learning and representation.Memory allows us to not make the same mistake twice, and it makes it possible that we form models of the world within our brains. But memory, specifically the storage of information, is not something that is automatic. We are so used to information storage in our current world that we often take it for granted. But from the point of view of (classical) physics, information is inherently fragile: The second law of thermodynamics is unforgiving unless specific measures are in place to prevent the deterioration of information.The means by which we store information has changed tremendously over time. In biology, the need to store information through means other than "filing it away" in the organism's DNA emerged once the world started changing on a timescale faster than an organism's lifetime. Before the Cambrian explosion, the animal world (mostly concentrated on the seafloor) was fairly predictable: The Ediacaran fauna were mostly sessile, and those that were not roamed the seafloor in idiosyncratic patterns that required no memory (Carbone & Narbonne, 2014). However, once the world became more complex, it was also changing more quickly. To survive in such a world, it is necessary to react promptly to those changes, and it is likely that the spiking neuron evolved precisely for this reason, by coupling sensory information directly to motor activation (Jékely, 2011). But in a world that changes fast, it is also necessary to learn from mistakes. To do this, the brain has to be able to recall prior sequences of events: It needs memory.The simplest form of memory is the preservation (and recall) of the fleeting state of the environment (this is also sometimes called working memory [see Baddeley, 1986], but others simply call it sensory short-term memory [Carruthers, 2014]). For firing neurons, keeping a signal in memory is not a trivial task, because after firing, the neuron returns to its quiescent state, thus erasing the signal. One way to preserve the signal is to stimulate the firing of another neuron, but how can that neuron's state be preserved? One solution is to arrange for the neuron to stimulate its own state by firing, and in evolutionary experiments with Markov brains,6 this is precisely what emerged (Edlund et al., 2011).More complex processing is required when an organism needs to recall sequences of events (often called episodic memory; Tulving, 2002). The logic of such memory can be understood using the following simple model. Suppose a brain needs to remember a particular temporal sequence of bits. In the simplest case, this is a sequence of two bits; that is, the brain needs to remember (and then recognize) one out of four possible temporal patterns. This can be accomplished by a set of three neurons: a sensory neuron, an actuator neuron, and an intermediary ("hidden") neuron (see Figure 2). In this simple model, the hidden (intermediary) neuron is necessary to keep time, and a particular logic can be constructed (it will also readily evolve) so that the actuator neuron (the motor unit M) fires if (and only if) a particular temporal sequence is experienced within the sensor S. This logic (shown in Figure 2b) connects the signal S and the state of the hidden unit H at the previous time point to the state of the motor M and the hidden unit at the subsequent time point. This particular logic table implements a simple dynamic: If the input state of S is 0 while H is zero, H remains zero along with M, signaling "no recognition." Because the sequence to be recognized starts with a 1, so far so good. Indeed, the hidden neuron's state remains quiescent until a 1 is sensed. If this occurs, the hidden neuron begins to keep time as the pattern 10 makes the hidden neuron fire, while the motor neuron remains quiescent. If the following input is a 1 instead, the table forces a return to the quiescent state 00, which we can interpret as "try again": The sequence 10 was not detected, and the logic is re-prepared in the initial state to wait for a 1. However, if a 0 did follow that 1 within S, this logic dictates the output pattern 10 within the pair MH: The motor neuron fires to indicate successful recognition of the signal, while the timekeeper is reset to zero to await another sequence.What I have described here is an extremely simple model of time series recognition (it is arguably the simplest), but it can easily be scaled up to handle longer time series. For example, in Figure 3a, we see the logic necessary to recognize a four-bit sequence, which needs two hidden neurons to keep time. The 3-in, 3-out logic gate shown in Figure 3a will recognize the sequence 0001: the binary version of the famous "fate" motif of Beethoven's Fifth Symphony (if you identify the 0 with G and the 1 with B♭, while ignoring the rhythm of the motif). Logic that recognizes specific time series evolves readily in Markov brains (Hintze et al., 2017). Moreover, a simple reinforcement learning algorithm can quickly change this logic, as I discuss later.Is logic of the sort described here used in actual brains to recognize time series? This is difficult to answer because it is very hard to reconstruct the logic of a set of neurons from the connections and recordings alone. However, given that the average motor neuron in the spinal cord, for example, receives inputs from thousands of other neurons that synapse on the cell body (see Alberts et al., 2002, chapter 11) it is conceivable that the positioning and strength of the synapses on the cell body (which can change with experience) implement precisely such logic.That an accurate prediction of the future is going to be advantageous for any living organism is obvious: There is value in information, in particular, in changing environments (Donaldson-Matasci et al., 2010; Rivoire & Leibler, 2011). The selective pressure to predict the future is intense. For this reason, the brain is often described as a "prediction machine" (Bubic et al., 2010; Clark, 2016; Friston & Kiebel, 2009; Hawkins, 2021; Hawkins & Blakeslee, 2004). But it is not only prediction of the future that is important. The information stored in genes is also used to predict the state of an organism's environment (for example, to ensure that the right genes are expressed at the right time). Predicting the future state of the environment, however, has several other advantages, for example, it becomes possible to plan ahead. One of the simplest prediction algorithms is not even based on neurons. The chemotaxis pathway that allows bacteria to swim "up" a resource concentration gradient can be viewed as a prediction algorithm that infers the location of the source of the resource from local concentration fluctuations.How do we know that the brain makes predictions? And what are the predictions based on? Rivoire and Leibler (2011) discussed the power of information for predicting changing environments by studying the dynamics of adapting populations (not specifically brains), but their analysis is quite general. Even though they make a number of simplifying assumptions (to make the problem mathematically tractable), the formalism is powerful enough to reveal a fundamental relationship between information (used for prediction) and fitness.Rivoire and Leibler's model describes agents that make decisions within a changing environment, using the information they inherited (stored in the genome) as well as information acquired from the environment.7 The model implies that if an agent optimally uses the information it has at its disposal, then it maximizes its fitness (measured in terms of the growth rate of the population).Imagine an agent described by a random variable Z that can take on a finite number of states z. The agent can change states over time, so we have to introduce the variable Zt, and the agent's trajectory over time is described by Z1Z2 ⋯ Zt−1Zt.8 At the same time, we define an environment variable Et that can take on states et. This environment changes over time in an uncorrelated manner,9 that is, the probability that Et takes on state et is P(E = et) = p(et). We now define the fitness function f(zt, et) that represents the expected number of offspring generated by the agent when it is in state zt while the environment is in state et, and we further assume that if the environment is in state et, then there is a particular state zt for which f(zt, et) = f(zt) > 0, while the fitness is zero otherwise. In other words, to be fit, the agent's state has to "track" the environment. Here we assume that the only way the agent can do this is by sensing the environment. For this, the agent is equipped with a sensor (described by random variable X), and we also assume for simplicity that this sensor is accurate. Consulting this sensor that accurately displays the state of the environment, the agent can change its own state accordingly so that its state always tracks the environment. Rivoire and Leibler showed that if an agent cannot sense the environment (or is incapable of interpreting the sensed value and changing its state accordingly), then the growth rate Γ0 of a population of such agents will beΓ0=〈logf〉−H(E),(4)where 〈log f〉 is the logarithm of the fitness (the logarithm of fitness represents the individual's growth rate) averaged over the possible states of the environment (the probability distribution p(e) is assumed to be the stationary distribution, that is, the one obtained from p(et) in the long time limit),〈logf〉=∑ep(e)logf(e).(5)H(E), in turn, is the uncertainty that the agent has about the environment, given in terms of the Shannon entropyH(E)=−∑ep(e)logp(e).(6)The interpretation of Equation 4 is straightforward: There is a cost to being blind to the environment's changes, and the cost in growth rate is exactly equal to the Shannon entropy of the environment.If an agent senses the environment with a perfect sensor X and follows an optimal strategy (a policy that will change the agent's state according to the sensed value), then the growth rate ΓX will instead beΓX=〈logf〉−H(E∣X),(7)where H(E|X) is the conditional entropy of the environment given the sensed value, that is, the remaining uncertainty after taking the sensor's state into account. We can write this using the conditional probability p(e|x), which quantifies how likely it is that we will encounter environment e given that we sensed x, asH(E∣X)=−∑e,xp(x)p(e∣x)logp(e∣x).(8)The value of this information for the population (the gain in predictability) is the difference between Equations 7 and 4:ΓX−Γ0=H(E)−H(E∣X)=I(E;X);(9)that is, this value is given precisely by the Shannon information I(E; X) gained by the agent. This information, as I mentioned earlier, is the difference between an unconditional entropy H(E) (what there is to know) and a conditional entropy H(E|X) (what remains to be known once we know X). It quantifies how well an agent can predict the state of E (the environment) armed with the state of X (the sensor).The model (simple as it is) brings the value of prediction into sharp focus: Information can be turned into fitness using an optimal strategy. Evolution, in turn, strives to find that optimal strategy. Multiple experiments evolving artifical brains of agents navigating changing environments (C G et al., 2018; Edlund et al., 2011; Fischer et al., 2020; Hintze et al., 2015; Iliopoulos et al., 2010; Kvam & Hintze, 2018; Marstaller et al., 2013; Olson et al., 2016; Tehrani-Saleh & Adami, 2021; Tehrani-Saleh et al., 2016) have supported that conclusion.Although making good predictions will, on average, increase an organism's chances for survival, it is also important to have a strategy to deal with situations in which a prediction turned out to be wrong. In general, the algorithm that allows us to change our actions based on environmental feedback is called reinforcement learning (Sutton & Barto, 1998). Clearly, learning is not an intrinsic capacity of an information-processing and -predicting system: This capacity needs to evolve. The literature on learning is vast and spans multiple different disciplines (see, e.g., Gross, 2020). Learning is an important concept in computer science, cognitive science, psychology, animal behavior, and more. Here I will focus on only one specific form of learning: reinforcement learning applied to neural networks. Specifically, I will discuss a very general reinforcement learning algorithm that is powerful yet simple to implement. It is also fast, and arguably close to optimal: the multiplicative weights update algorithm (MWUA; Arora et al., 2012; see also Chastain et al., 2014). This is not a new method: It has been used independently in numerous fields, such as machine learning, decision theory, optimization, and evolutionary game theory. It may very well be an algorithmic description of what is going on in our brains as we learn.In what follows, I describe the Expert Advice Problem: how to (repeatedly) pick out one from a set of n experts and follow that expert's advice. We can imagine that this expert is giving advice on stock picks, so we are interested in finding the strategy that will maximize our payoff. Given the particular environment (inputs that the expert senses), each expert makes a particular prediction based on whatever information or intuition the expert has. This prediction could be good one day and terrible the next. Every expert is given a weight wi, and at the outset, all of the weights are equal. An algorithm determines which of the experts' advice will be chosen to pick the stock, after which the weights are changed based on whether the investment led to a gain or a loss. This is the feedback part of the learning algorithm. We would like to know the optimal way to pick an expert based on the weights and how to best change these weights given the feedback. Here we consider only the case in which the likelihood that any particular expert will be picked for advice next time is proportional to the expert's weight (another option is the "weighted-majority" algorithm, whereby the prediction that has the highest total weight of experts advising it is chosen; Arora et al., 2012).To make it clear from the outset how this relates to artificial brains, we can think of each of the experts as a particular firing pattern of a group of neurons in response to a sensation. Each pattern gives rise to a particular behavior, and the task is to optimize the fitness of the behaving agent given the sensation. In a sense, we are asking, Given a particular sensed pattern (and the behavior it triggers), how should we change the likelihood of each of the possible response patterns given the feedback we receive for the ensuing behavior?For the stock-picking experts, we are looking for the optimal way to adjust the weights given the outcome of the previous pick (a gain or a loss). It turns out that the optimal algorithm is fairly simple (Arora et al., 2012; Chastain et al., 2014). Suppose that at time point t, we pick expert i with a probability proportional to its weight,pi(t)=wi(t)∑inwi(t),(10)and we record a gain gi. If the gain is positive, then we increase the weight of this expert (in a multiplicative manner) by an amount proportional to the gain. If gi is negative, we decrease the weight:wi(t+1)=wi(t)(1+ϵgi(t)),(11)where ϵ is a small constant. We see that if the feedback is positive, then at the next time step this expert will have a higher likelihood of getting picked, while all other experts will be picked with a smaller probability. Translated to brains, we can say that those firing patterns that led to a positive outcome will be reinforced, that is, the likelihood of this pattern to fire given the same sensory input is increased, while other patterns are suppressed. This is reminiscent of the Hebbian learning rule (Hebb, 1949) that reinforces the synaptic connection between neurons if they fired together (and a positive outcome was achieved), only in the MWUA the reinforcement occurs not on the connection between two neurons but rather on the likelihood of a particular firing pattern of a set of (possibly) many neurons together.The rate at which the optimal behavior emerges when accurate feedback is given (the learning rate) depends on the constant ϵ. Although a large ϵ tends to change weights (and therefore firing probabilities) faster, it also makes the algorithm less than optimal. We predict that the expected gain 〈m(t)〉 = ∑i=1npi(t)gi increases quickly given the feedback that discourages calling on patterns that score poorly. Indeed, it is possible to show that the cumulative expected gain 〈G〉, given by the expected gain accumulated over the entire lifetime T, is just a little smaller than the optimal return (Arora et al., 2012):〈G〉=∑t=1T∑i=1npi(t)gi(t)≥(1−ϵ)Gopt−lnnϵ,(12)where n is the number of experts (or the number of possible firing patterns in a particular group of neurons) and Gopt is the optimal gain,Gopt=maxG1,G2,⋯,Gn,(13)which is obtained by choosing the expert who (in hindsight) would have accumulated the largest return. Choosing a large ϵ might make the system learn faster, but the lower bound on the expected return decreases quickly too. Often, the choice ϵ ≈ ln(n)/T is a good trade-off (Chastain et al., 2014).This learning algorithm is easy to implement on Markov brains, as the pi(t) are simply given by the entries in the transition table for the "Markov gates" that we encountered earlier. For example, suppose that the transition table shown in Figure 3b, which, as we recall, implements the simplest algorithm for the recognition of Beethoven's "fate" motif, has instead probabilities th

Referência(s)
Altmetric
PlumX