The third AI summer: AAAI Robert S. Engelmore Memorial Lecture
2022; Association for the Advancement of Artificial Intelligence; Volume: 43; Issue: 1 Linguagem: Inglês
10.1002/aaai.12036
ISSN2371-9621
Autores Tópico(s)Anomaly Detection Techniques and Applications
ResumoThis article summarizes the author's Robert S. Englemore Memorial Lecture presented at the Thirty-Fourth AAAI Conference on Artificial Intelligence on February 10, 2020. It explores recurring themes in the history of AI, real and imagined dangers from AI, and the future of the field. We are now in AI's third summer, a period of rapid scientific advances, broad commercialization, and exuberance—perhaps irrational exuberance—about our potential to unlock the secrets of general intelligence. Twice before the field of AI has experienced such a period, and each was followed by a winter of collapse of commercialization and drastic cuts in government investments in research. In this essay, I will argue that despite this cyclical history, enduring insights have blossomed each summer. The winters can be viewed as times of contemplation and integration that advance through the synthesis of new and old ideas. I will also argue that we may be at the end of the cyclical pattern; although progress and exuberance will likely slow, there are both scientific and practical reasons to think a third winter is unlikely to occur. In every summer, articles and books about AI written for nonexperts have found wide audiences. I read four recent books shortly before writing this essay: The Master Algorithm, by Domingos (2015); AI Superpowers, by Lee (2018); Human Compatible, by Russell (2019); and Rebooting AI, by Marcus and Davis (2019). This first is an objective history of machine learning, and like this essay, emphasizes the continuous evolution of the field. The second charts the dramatic rise in AI R&D in China and points the way to a utopian future. The third argues that superhuman artificial intelligence will be an existential risk if the values of such AIs are not aligned by design with those of humans. The fourth contends that deep learning, the most powerful approach to machine learning devised to date, will soon reach inherent limits, and that a different approach that synthesizes recent and older approaches to AI will be necessary in the future. This essay will touch on many of the same elements as these three books. I will first provide a history of AI; next, discuss near-term dangers of AI; and finally, describe a number of different technical approaches for future AI. If one was to create a cartoon history of AI, the first panel would show the symbolic approach to AI—pictured, say, as a cat, beating up on the artificial neural network approach—let us picture it as Jerry the Mouse. The second panel shows both Tom and Jerry shivering in a wintery scene; and the third shows Jerry, now grown huge and powerful through deep learning, easily dispatching Tom (Figure 1). There is more than a grain of truth in this cartoonish view of the history of AI from the 1980s through the present day. The story it presents is incomplete, however, both in chronology and in failing to illustrate the rich set of ideas and approaches that developed and entwined through the history of the field. William Grey Walter was a polymath in neuroscience and electronics. As a young man in the 1930s, he built the first electroencephalography (EEG) in the United Kingdom and discovered that the measurement of brain waves could be used to locate brain tumors responsible for epilepsy (Walter 1953). Thirty years later, a groundbreaking paper he coauthored in Nature showed that spikes in neural activity could be used to predictive motor events a full half-second before the subject was consciously aware of having made the decision to move—in other words, that the conscious mind only thought it was making decisions (Walter et al. 1964). Walter was as much an engineer and tinkerer as a scientist. During WW II, he designed radar systems. The mechanistic view he took of the brain led him to experiment with artificial neural networks—not just as a mathematical abstraction, as in the work of McCulloch and Pitts (1943), but as the decision-making engine for an embodied artificial animal (Figure 2). Beginning in 1948, he built and demonstrated a series of increasingly sophisticated autonomous tortoise-shaped three-wheeled robots (Hoggett 2011). Their analog electronic brains employed up to seven vacuum tubes, which interpreted signals from touch, light, and sound sensors and controlled propulsion and steering motors. Although their behavior was hard-wired, the later versions supported a form of conditioned-reflex learning. A capacitor-based memory could learn to associate the simultaneous activation of two sensors—for example, the sound of a whistle and the obstacle detecting bump sensor. The reflex triggered by the bump sensor—backing up and turning—could then be triggered by the sound sensor. The tortoises' legacy includes the field of artificial neural networks, which today dominates research and development in artificial intelligence. A few of the well-known major steps in the development of artificial neural networks were the error-based perceptron learning rule of Rosenblatt (1958), the development of backpropagation for training multilayer networks (Werbos and Werbos 1974; Rumelhart, Hinton, and Williams 1986), and parameter sharing in structured networks, and in particular, convolutional networks (Fukushima 1980; LeCun et al. 1989). It is easy to see the aspects of artificial neural networks that Walter got wrong: most obviously, the use of analog electronics and the focus on stimulus—response learning rather than error-minimization learning. It is just possible, however, that Walter was simply wildly premature. Artificial neural networks are now being compiled into edge-computing hardware for applications such as video surveillance; while such hardware is now digital, there is research on creating analog artificial neural networks that could operate with a fraction of the energy needed by digital circuits. Furthermore, one could argue that the tortoises' implementation of stimulus–response learning was an early attempt at unsupervised learning—which is today the most important and challenging problem in research on machine learning. Artificial neural networks were not the only legacy of the tortoises. They demonstrated that complex purposeful behavior arises in the interaction between an agent and an environment, an idea that stands in sharp contrast to the more cerebral symbolic approaches to AI that we will describe shortly. Walter was part of a larger movement that aimed to understand animal and machine intelligence using feedback loops and other tools of control theory. The field was given the name "cybernetics" with the publication of Norbert Weiner's book of that name (Weiner 1948). The tortoises were a perfect example of a mechanism regulated by feedback from their environment. Cybernetics flourished in the former Soviet Union, but never gained a foothold in the US AI research community until a synthesis of control theory and dynamic programming (Bellman 1957) emerged under the banner of reinforcement learning (Witten 1977; Sutton and Barto 1981). Even then, researchers in reinforcement learning were a small minority in the general AI community for decades. Researchers made steady progress in developing mathematical frameworks for training control systems when the feedback signal was distant in the future. The fact that rewards can be temporally distant from the agent's actions distinguishes reinforcement learning from stimulus–response learning; indeed, the ability to act for delayed gratification is a key aspect of intelligence. Temporal-difference learning (Sutton 1988) provided a general approach for implementing gratification, and proved to be particularly effective when the agent's internal state was represented by a neural network. We shall see the potent combination of artificial neural networks and reinforcement learning reemerge in the third AI summer. The first AI summer also saw the birth of a very different approach to building intelligent machines, an approach whose heritage stretched back thousands of years. This is the logic-based approach to AI, or more generally and accurately, the approach based on declarative knowledge representation. Symbolic logic grew out of the art of rhetoric in ancient Greece. Around 350 BC, Aristotle formalized certain kinds of deductive arguments symbolically in his Prior Analytics. His key insight—indeed, the insight that is the basis for not only logic but for the theory of computing—is that reasoning could be performed by considering only the syntactic form of statements without considering the meaning of those statements. After this prescient beginning, however, over 2000 years passed before significant advances were made in formal logic. George Boole created a complete characterization of proposition logic in 1845 (Boole 1854), as Gottlob Frege, Charles Sanders Peirce, David Hilbert, and others did for quantified logics in the following decades. This generation of philosophers, however, had a primary motivation for their work that differed from that of Aristotle and his medieval followers: their ultimate goal was to provide a complete and rigorous basis for mathematics rather than to understand everyday reasoning and argumentation. They, therefore, poured enormous energy into trying to overcome the paradoxes of naive set theory (Russell 1903) and were devastated by the discovery that no logic could capture all mathematical truths (Gödel 1931). The concerns of the researchers who pioneered the logical approach to AI stood in sharp contrast to those of the philosophers of mathematics. First, the AI researchers were encouraged by the creation of programs that could automatically find proofs of some—not necessarily all—mathematical theorems, and were untroubled by logic's inherent incompleteness. The celebrated Logic Theorist program (Newell and Simon 1956) was able to prove 38 elementary theorems from Principia Mathematica (Whitehead and Russell 1910–1913). Second, most AI researchers had little interest in mathematics as a subject matter of logic. Instead of trying to axiomatize abstruse mathematics, John McCarthy argued, researchers should strive to develop logical representations of commonsense knowledge (McCarthy 1958). McCarthy's original paper described knowledge about locations (e.g., one can be at a desk, in a car, etc.) and physical movement (e.g., one might walk from one location to another nearby location), and his former student Patrick Hayes called for axiomatization of commonsense physics (Hayes 1978). Others attempted to represent the logical rules of human discourse (Allen et al. 1977), thus closing the loop with the ancient Greeks' view of logic as a tool for analyzing rhetoric. Researchers in the first AI summer also began work on systems that employed graphs rather than the strings or trees of classical logic to represent knowledge. These new kinds of representations were called "semantic networks" and used vertices to represent concepts and edges to represent relationships. The word "semantic" came from their initial use as an intralingua for translating between different natural languages (Richens 1956); they were intended to capture the meaning, or semantics, of sentences. Although they were presumably unaware of it at the time, one researcher has argued that semantic networks were a rediscovery of the diagrams that ancient Sanskrit scholars used to analyze texts (Brigs 1985). Researchers increasingly converged on the view that semantic networks were simply an alternative notation for classical logic, as exemplified by Ronald Brachman's work on KL-ONE (Brachman and Schmolze 1985). Just as all practical programming languages were Turing-complete and thus theoretically equivalent but differed in ease or naturalness of use, these researchers argued that semantic networks were simply a more natural form of first-order logic where syntax explicitly described concepts in terms of their attributes and how they generalized or specialized other concepts. This version of semantic networks became known as "description logic." Pure description logic, however, proved inadequate for representing large real-world domains because it could only capture the absolutely necessary properties of concepts, not those that were prototypical or that held by default. In recent years, companies including Google, Facebook, Microsoft Bing, eBay, and IBM have developed enormous networks called "knowledge graphs" which they use to drive many applications, such as web search and product recommendation (Singhal 2012). Despite their scale and ubiquity of use, many aspects of knowledge graphs remain informal; for example, in addition to the issue of whether links represent absolute or prototypical relations, the distance between concepts in a knowledge graph is often used as a heuristic measure of concept similarity. Later in this essay, we will describe a different family of graph-based knowledge representation formalisms called "graphical models" that combine logic, graph theory, and probability theory. The first AI summer's third research campaign was the quest for efficient algorithms for combinatorial search. We now understand that in terms of formal computational complexity theory, the quest is an impossible dream: the general task of reasoning in any suitably expressive formal system is NP-complete or harder (Cook 1971), and thus, it is believed, requires worst-case exponential time. Even the problem of STRIPS-style planning—that is, finding sequences of actions that are defined in terms of preconditions and effects—for the simple "blocks world" domain is NP-complete (Gupta and Nau 1991). However, the fact that such complexity results had not yet been discovered may have helped lead the early AI researchers to an important insight: an enormous space of possibilities could be searched in ways that were more efficient than simple enumeration. This insight differentiated AI researchers from philosophers and mathematicians, for whom the existence or nonexistence of an algorithm that would terminate after an exhaustive enumeration of possibilities was the end of the discussion: this problem was decidable, or that problem was undecidable. It was obvious to AI researchers that human reasoning was not a simple enumeration, but involved shortcuts that made the task feasible given the time and computational resource limitations of the brain. Herbert Simon, J. C. Shaw, and Alan Newell discovered and implemented one such search algorithm, means-ends analysis, in their General Problem Solver (1959), and later Newell and Simon argued that humans employed it as well as a variety of other reasoning strategies in their monumental treatise Human Problem Solving (1972). How can non-enumerative search be practical when the underlying problem is exponentially hard? The approach advocated by Simon and Newell is to employ heuristics: fast algorithms that may fail on some inputs or output suboptimal solutions. For example, the means-end planning heuristic chooses an action, which will reduce the difference between the initial and goal state; applies the action initial state; and recursively applied the process to the new state and the goal state (Figure 3). Although intuitively appealing, it is not difficult to find problems where the heuristic fails, stuck in a cycle where it reduces one difference but introduces another. The A* algorithm (Hart, Nilsson, and Raphael 1968) provided a general frame for complete and optimal heuristically guided search. A* is used as a subroutine within practically every AI algorithm today but is still no magic bullet; its guarantee of completeness is bought at the cost of worst-case exponential time. An interesting class of incomplete heuristic search algorithms is those based on a "noisy" version of iterative repair, a heuristic similar to means-end analysis. Iterative repair begins by guessing a solution to the problem. It then iteratively identifies a flaw in the solution and patches it, yielding a new proposed solution. As with means-end analysis, simple iterative repair can easily become stuck in a cycle. A noisy version of iterative repair reduces the likelihood of becoming stuck by periodically making random changes in the solution; even if most of the random changes are bad, eventually a change is likely to be introduced that lets the search break out of the cycle. A version of noisy iterative repair named "simulated annealing" was invented by physicists and has proven widely applicable for optimization problems (Kirkpatrick, Gelatt, and Vecchi 1983). Bart Selman and I showed that a simple version of iterative repair called "local search with noise" was even more effective for finding satisfying assignments to logical formulas; the reason for the improvement was that the random steps were restricted to ones that made the proposed solution satisfy at least one previously unsatisfied problem constraint, even if did the reverse for some of the other constraints (Selman, Levesque, and Mitchell 1992; Selman, Kautz, and Cohen 19961). Another way to solve in practice an NP-hard problem is to employ an algorithm whose empirical complexity on a problem distribution of interest grows subexponentially or exponentially with a very small exponent. For example, the best complete algorithm for satisfiability testing is backtracking search over the space of partial variable assignments, the Davis–Putnam–Logemann–Loveland algorithm (DPLL) (Davis et al. 1961), augmented by a technique called "clause learning" (Marques-Silva and Sakallah 1996; Bayardo and Schrag 1997). When the backtracking algorithm reaches a dead end—that is, when it determine that the current partial assignment is inconsistent - the clause learning module computes a minimal subset of previous assignment choices that led to the inconsistency and adds the negation of that combination to the problem as a new clausal constraint. The new clause prevents those choices from being made in a different branch of the search tree, thus pruning the search. Although still an exponential algorithm, my colleague Paul Beame, student Ashish Sabharwal, and I showed that DPLL with clause learning is probably more powerful than DPLL (Beam, Kautz, and Sabharwal 2004). The algorithm demonstrates remarkably restrained growth in many real-world problem domains. For example, Pushak and Hoos (2020) argued that the algorithm on bounded model-checking problems shows subexponential empirical scaling. During the first AI summer, many people thought that machine intelligence could be achieved in just a few years. The Defense Advance Research Projects Agency (DARPA) launched programs to support AI research with the goal of using AI to solve problems of national security; in particular, to automate the translation of Russian to English for intelligence operations and to create autonomous tanks for the battlefield. Researchers had begun to realize that achieving AI was going to be much harder than was supposed a decade earlier, but a combination of hubris and disingenuousness led many university and think-tank researchers to accept funding with promises of deliverables that they should have known they could not fulfill. By the mid-1960s neither useful natural language translation systems nor autonomous tanks had been created, and a dramatic backlash set in. New DARPA leadership canceled existing AI funding programs. In 1969, the powerful Senate Majority Leader Mike Mansfield hobbled AI research funding by all military agencies for decades by pushing through a law that prohibited military funding of fundamental research beyond specific military functions. Outside of the United States, the most fertile ground for AI research was the United Kingdom. The AI winter in the United Kingdom was spurred on not so much by disappointed military leaders as by rival academics who viewed AI researchers as charlatans and a drain on research funding. A professor of applied mathematics, Sir James Lighthill, was commissioned by Parliament to evaluate the state of AI research in the nation. The report stated that all of the problems being worked on in AI would be better handled by researchers from other disciplines—such as applied mathematics (Lighthill 1973). The report also claimed that AI successes on toy problems could never scale to real-world applications due to combinatorial explosion. This claim, of course, ignored the quest in AI for methods to tame combinatorial search as described above. In response to the report, all public funding of AI research in the United Kingdom was terminated. The second AI summer was marked by the field's change in focus from commonsense knowledge to expert knowledge. Expert systems, it was believed, would be able to substitute for trained professionals in medicine, finance, engineering, and many other fields. An expert—say, a doctor—would be debriefed by a knowledge engineer, who would encode the expert's vast experience into a large set of rules and facts. A general symbolic reasoning system could then apply these rules to solve particular problems—for example, to create a diagnosis on the basis of a patient's symptoms. The rules could also drive the system to gather further information—for example, to order certain blood tests for the patient in order to refine the diagnosis. The date for the beginning of the second summer is written in the section heading using regular-expression notation to mean that it could be said to have started in 1968 or to have started in 1978. In 1968, Feigenbaum, Lederber, and Buchanan (1968) created the first expert system, Dendral. It was intended to help organic chemists in identifying unknown organic molecules by analyzing their mass spectra and using knowledge of chemistry. Dendral gained much academic interest and led to the development of expert systems in other domains, notably MYCIN (Shortliffe and Buchanan 1975) for bacterial infection diagnosis and INTERNIST-I, which aimed to capture the internal medicine expertise of the chair of the department of internal medicine at the University of Pittsburgh (Pople 1976). It was not until 1978, however, that expert systems became a hot area of R&D with the creation and commercial deployment of XCON (McDermott 1980). In the 1970s, buying a computer system was a slow and error-prone process. Computers were much less standardized than they are today, and a buyer needed to choose among hundreds of options when placing an order for one. Options could interact in complex ways: some combinations of options could not be physically built or if built would lead to poor performance; some options required choices from other options; and so on. The process to order a VAX computer from Digital Equipment Corporation (DEC) could require as long as 90 days of back-and-forth between a customer, sales representatives, and DEC engineers to create a correct system configuration. XCON reduced the time to generate a satisfactory system configuration for a customer to about 90 minutes. The enormous advantage this gave DEC in the marketplace did not go unnoticed. Soon, companies of all sorts began developing and deploying expert systems for a variety of tasks in engineering and sales. Feigenbaum's phrase "knowledge is power" became the slogan of the era. The second AI summer differed from the first in that it was driven as much by commercial money as by government support. In addition to investments by companies using expert systems, venture capital flowed into companies creating a software and hardware ecosystem to support expert systems. Software startups sold expert system "shells," that is, reasoning engines with user interfaces intended to make it possible for nonprogrammers to enter rules. The fact that expert system development was incremental meant that dynamically linked programming languages were preferred—which in the 1970s and 1980s meant varieties of LISP or Prolog. The relatively slow performance of these languages with the implementations and hardware of the era motivated building computer hardware to directly interpret LISP (by the startup companies Symbolics and LMI) or Prolog (by various Japanese companies under the auspices of Japan's Fifth Generation project). Many reasons can be offered for the arrival of the second AI winter. The hardware companies failed when much more cost-effective general Unix workstations from Sun together with good compilers for LISP and Prolog came onto the market. Many commercial deployments of expert systems were discontinued when they proved too costly to maintain. Medical expert systems never caught on for several reasons: the difficulty in keeping them up to date; the challenge for medical professionals to learn how to use a bewildering variety of different expert systems for different medical conditions; and perhaps most crucially, the reluctance of doctors to trust a computer-made diagnosis over their gut instinct, even for specific domains where the expert systems could outperform an average doctor. Venture capital money deserted AI practically overnight. The world AI conference IJCAI hosted an enormous and lavish trade show and thousands of nonacademic attendees in 1987 in Vancouver; the main AI conference the following year, AAAI 1988 in St. Paul, was a small and strictly academic affair. Commercial factors aside, enthusiasm for expert systems cooled because of two central technical challenges; indeed, overcoming these challenges set the workplan for the next two decades of research in AI. The first challenge was the need for principled and practical methods for probabilistic reasoning. The logical rule-based approach excelled at capturing knowledge about relationships among concepts and entities (such as class/subclass/instance or object/part/attribute hierarchies) but was poorly suited for problems where one needed to assign probabilities to conclusions. Although the need to handle uncertainty was recognized by early expert system researchers, they did not yet know of probabilistically sound methods of reasoning that were computationally practical; systems such as MYCIN and its descendents instead attached "certainty factor" numbers to rules and facts and combined them in an ad hoc manner. The second unsolved challenge for the expert system approach was named the "knowledge acquisition bottleneck." Capturing all but the narrowest domains required a huge number of rules. Not only was it difficult or impossible to recruit and train enough experts to write enough rules, but once the knowledge bases became large they inevitably became full of inconsistencies and errors. The field of AI did not disappear during the slightly more than two decades of the second AI winter. It continued steadily as a relatively small but intellectually vigorous research field, freed of the hype and demands for commercial profit. The challenge of sound but efficient probabilistic reasoning was first met by what were called graphical probabilistic models. Bayesian Networks (Pearl 1988) provided a solution to the problem of compactly representing multivariable probability distributions without requiring exponentially large probability tables. Each conditional probability statement was represented by a set of directed edges that ended at a node together with a conditional probability table for the variable associated with the node. The graph has a much stronger meaning, however, than the conditional probability statements alone: it represents a single probability distribution—the so-called maximum-entropy distribution—rather than all distributions that are consistent with the original conditional probability statements. In many problems, one does indeed want to reason with a maximum-entropy distribution because it is the one under which our given knowledge captures all interesting relationships between the variables. The introduction of Bayesian networks led to fruitful decades of research on extensions to Bayesian networks, alternative graphical models, and a variety of new algorithms for probabilistic reasoning. Heckerman and Shortliffe (1992) discovered the conditions under which MYCIN's certainty factors could be given a probabilistic interpretation, thus explaining why expert systems sometimes gave sensible answers but at other times did not. Around the turn of the century, the field of statistical–relational reasoning arose, which sought to develop representations and algorithms that combined the semantics of graphical models with the expressive ability of the finite fragment of first-order logic (Friedman et al. 1999; Richardson and Domingos 2006). Overcoming the knowledge acquisition bottleneck led the field of AI to a renewed focus on machine learning. For most of the second winter, however, few researchers returned to the roots of machine learning in artificial neural networks. Methods were developed for learning decision trees (Quinlan 1986) and logical rules (Muggleton and Feng 1990). The parameters (conditional probability tables) for graphical models could be directly estimated from complete data or estimated by the expectation–maximization algorithm for incomplete data (Dempster, Laird, and Rubin 1977). Valiant's (1984) work on probably approximately correct (PAC) learning showed the limits of learnability for any method relative to the amount of data that were available. Until the revival of artificial neural networks in the third summer, the most powerful approach to "black box" machine learning, that is, that did not rely upon or attempt to create an interpretable domain model, was the support vector machine (SVM) pioneered by Cortes and Vapnik (2004). An unintuitive feature of SVMs is that they often worked well when highly over-parameterized—a situation that had been thought to be necessarily associated with overfitting. Deep learning with artificial neural networks turned out to share this surprising feature. Even as AI research methodology became steadily more rigorou
Referência(s)