Influences and Inferences

Artigo Acesso aberto Revisado por pares

Influences and Inferences

2013; Association for Computational Linguistics; Volume: 39; Issue: 4 Linguagem: Inglês

10.1162/coli_a_00171

ISSN

1530-9312

Autores

Jerry R. Hobbs,

Tópico(s)

Lexicography and Language Studies

Resumo

I am deeply honored to receive the ACL's Lifetime Achievement Award. I'm especially honored when I look back at the list of previous winners—Chuck Fillmore, Eugene Charniak, Eva Hajičová, Fred Jelinek, Martin Kay, Aravind Joshi, and the others—they're all my heroes.I was of course delighted to learn of this award. The most we can hope for in life is to take part in the conversation, and an award like this means that you've taken part in the conversation.It seems to be a tradition to begin with a few formative anecdotes from childhood. For me, it all begins before I was born. My grandfather, a crusty old country lawyer in southern Indiana, told my father not to bother trying to go to law school. "You don't know English grammar," he said. "You'll flunk out." My dad accepted the challenge, bought a book entitled English Grammar, by Smith, Magee, and Seward (1928), and mastered it. He went on to become a very successful lawyer.Fast-forward to when I was in junior high school. My dad was distressed that my English classes looked to him more like social studies, and barely touched on grammar. So he persuaded me—actually, he probably bribed me, but I can't remember what with—to master that same book, English Grammar by Smith, Magee, and Seward. This was a concession, because I was a math nerd, reading only textbooks on trigonometry and calculus, as my way of avoiding the humiliation of playing baseball. But I read the book, and I was amazed. English grammar was just like math! It had the same sorts of rules, the same kinds of abstractions, the same types of puzzles. It was actually fun!In my junior or senior year of high school we had to take something called the Kuder Preference Test, which would help us decide what career to choose. I scored high in math and in language. So my high school counselor told me I should write math books. In fact, she got it exactly backwards. It wasn't that I should do language about math. It was that I should do math about language.I've met any number of computational linguists with a similar story. They grew up not knowing whether they wanted to be a physicist or a poet. They just knew both sounded fascinating. Then they discovered our field.My last near miss happened the week I was drafted into the Army. They gave us a battery of aptitude tests to see what specialties we'd be best for. One of the tests was to see if we should be sent to the Monterey Language School. Looking back on it, I realize now it was testing how well you could understand formal language theory. They'd give you a bunch of rules for an artificial language, and you'd have to say whether different strings were in or not in the language. I'd never seen anything like it, but it was really fun to do. Later I met with a personnel specialist who went over my test scores. I got a 46 out of 50. He ignored that until I pointed it out to him. Then he said, "That's a mistake. Nobody ever gets more than 6 or 7 points on that test."I said, "No, I think it might be correct."He said, "It doesn't matter. You're not going to the Monterey Language School. You're going to South Vietnam."Actually, I didn't go to the Monterey Language School or to South Vietnam. I spent two years in South Carolina, and was glad to be there. How I managed that is a story for another occasion.So I didn't really discover computational linguistics until my third year in graduate school at New York University. In October I passed my oral exam in topics like algebraic topology and complex analysis, by one generous yes and two abstensions. In the subsequent months I discovered more and more facts about myself—for example, that I was never going to figure out a faster way of multiplying matrices, and that fascinating though recursion theory might be, I was never going to prove a theorem that Hartley Rogers would be compelled to include in his next edition. As I surveyed vaguely plausible fields, I realized I had no idea what the next problem to solve would be or even what makes a problem interesting.Then in April, when I had nearly resigned myself to becoming a taxi driver, I discovered New York University's best-kept secret: Naomi Sager's Linguistic String Project. I think it is also computational linguistics' best kept secret as well. She was motivated by the science, not by the performance, and her very impressive work is nowhere near as well-known as it should be. I think her Linguistic String Grammar (Sager 1981) ranks, as a computational specification of English syntax, with Pollard and Sag's Head-driven Phrase Structure Grammar (1994), for thoroughness, insight, and elegance. So, for example, in 1992 when we developed the FASTUS system for information extraction using cascaded finite-state transducers (Hobbs et al. 1997), it was straightforward to copy the rules for Noun Groups straight from her grammar. It's no accident that in the late 1980s during the Strategic Computing Initiative and in the early 1990s in the Message Understanding Conferences, three of the most important efforts were led by Linguistic String Project alumni—Ralph Grishman's group at New York University, Lynette Hirschman's at Unisys, and my group at SRI International. I think the most important lesson I learned from Naomi Sager was to look closely at the data and to take it seriously.My other thesis advisor was Jack Schwartz. He was a polymath, so to speak. I took a course in logic from him. I knew about his book on compilers and the classic Dunford and Schwartz on functional analysis. But when I saw his book on mathematical economics and his book on the theory of relativity, I did some research to see if there was more than one Jack Schwartz. Among his writings was an unpublished Chapter 9 of his compilers book, on parsing natural language, which I of course read.My thesis was on Earley's algorithm applied to natural language. It quickly became apparent that the constraints on phrase structure rules had to be expressed and that one could do that with fairly simple operations on vectors of features, where among the features were what I called the "cores" of the constituents, since they bundled many of the relevant features. My "core" was what linguists came to call "head." Years later, I ran across Chapter 9 again and reread it, and realized that all the ideas in my thesis were there. So when in 1987 Schwartz told someone that I had anticipated head-driven phrase structure grammar, that was his way of saying he had anticipated head-driven phrase structure grammar.My first job was at Yale University as a very temporary instructor—I think the position is now called "post-doc." Over the course of the year I became convinced that syntax was a solved problem—something I still believe. But that left me adrift for problems to work on. I became discouraged, and found myself thinking again about driving that taxi. Then late one afternoon, just as I was about to go home, a graduate student named Fred Howard came into my office to ask a couple of questions. That triggered a discussion that lasted until 11 o'clock that evening. One of the wheels we reinvented was a recognition of the pervasiveness of spatial metaphor in discourse. (This was before Lakoff and Johnson [1980], but after similar observations by the 18th-century Italian philosopher Giambattista Vico (1968 [1744]) and the 20th-century English literary critic I. A. Richards [1936].) But within a year, everything else of value that remained of the content of that discussion could be compressed into a long footnote in a technical report. In any case, this conversation lit a fire that fueled my research for the next 15 or 20 years.In particular, I began looking at texts, trying to understand how we understand them. No doubt influenced by Chuck Rieger's thesis (Rieger 1974), I asked what inferences we draw in the course of comprehension, and, an issue Rieger did not address, what inferences we do not draw. This culminated in 1976 in an unreadable (and unread) technical report (Hobbs 1976), microanalyzing one paragraph from Newsweek, trying to specify every bit of knowledge required for understanding the text and describing how every linguistic problem in the text invokes that knowledge to arrive at solutions. One could say that the rest of my career has been a matter of cleaning up and extending that technical report, in terms of representation, the process of inference and interpretation, and the specification of common-sense knowledge.In 1977 I moved to SRI, where I fell under the influence of Nils Nilsson and Bob Moore, and of John McCarthy at nearby Stanford. They were campaigning to replace the ad hoc styles of representation of early AI with representations based on first-order logic. But the problem in a nutshell is this: When we are trying to represent an English sentence like Pat believes Chris is tall, we really want to write (1) believe(Pat, tall(Chris)) The difficulty is that tall is a predicate and tall(Chris) evaluates to true or false, so we are left with Pat believes a truth value, with not a hint of Chris's tallness to be found. A common solution to this is to treat believe not as a predicate but as an opaque operator that blocks evaluation of its operands.Many special logics have been developed for such operators. For example, knowing about modal and temporal logics, Russell's iota operator, functionals, lambda expressions, and so on, we might represent the sentence (2) Maybe the boy wanted to build a boat quickly. by the expression (3) This bothered me because it seemed like we were introducing a new operator with its own special logic every time we encountered a new word to define or characterize. For 20,000 words would we have to introduce 20,000 new operators? It seemed to me that we should rather stay within first-order logic, abiding by two principles: 1. All morphemes are created equal.2. Every morpheme conveys a predication. We could achieve this kind of representation by means of reification. Thus, if tall'(e, Chris) says that eventuality e is the state or eventuality of Chris being tall, then we can represent Pat believes Chris is tall by (4) believe(Pat, e) ∧ tall'(e, Chris) Sentence (2) is then represented (5) There's nothing exotic here (other than reification). It's all first-order logic, predicates applied to arguments where the arguments are existentially quantified variables with widest possible scope, ranging over a universe of possible individuals.The extremes to which we go in identifying morphemes with predications can be seen in the predication the(x, e3). What could that possibly mean? Well, ask what information is being conveyed by the word the. It is a relation between an entity x and a description e3, and it says the entity is uniquely mutually identifiable in context by means of the description. We can give this relation a name. We could call it something like uniquely-mutually-identifiable-in-context. But why not keep it simple, and name the predicate after the morpheme that conveys it – the?Knowledge representation schemes that use extensive reification are often called "Davidsonian," after the philosopher Donald Davidson (1967), who proposed reifying events. But he balked at reifying states, let alone negations of states and events. He would not have treated Chris's tallness as a thing. By contrast, I adopted a position that, because I was young and wild, I called "ontological promiscuity." Now that I'm older and more domesticated, I would probably call it something like "ontological prosperity" or "ontological comfortable circumstances" or maybe "ontological glut."Many balk at such abandonment of ontological scruples. No doubt I was influenced by the near solipsism that infected many researchers in the early days of AI. Our brains could be fooling us, just as we often fool computers to test our programs. Yes, there is probably a world out there that occasionally bites back. But the world is benevolent—after all, we evolved in it. When we breathe, there is almost always oxygen there. That's no accident. So it doesn't matter very much what we believe. We can believe all sorts of crazy things and be completely ignorant of apparently real and pervasive phenomena. Until the recent past we believed in the spirits of the dead, and we were entirely ignorant of 98% of the electromagnetic spectrum. If you are willing to admit the existence of physical objects, sets, numbers, and possible worlds, what ontological scruples do you have anyway? So why should we give any credence at all to our intuitions about what exists and what doesn't? Why not simply stipulate that everything that can be talked about exists in a Platonic universe of possible individuals, since that makes it so much easier to represent and reason about the content of natural language discourse?The result of this move and similar reifications to eliminate quantifier scopings is that the logical form of a sentence is a flat conjunction of existentially quantified propositions, with one predication per morpheme.But there is a problem. The sentence (6) John is tall. would be represented (7) John'(e1, x) ∧ tall'(e3, x) whereas the sentence (8) John is not tall. would be represented (9) John'(e1, x) ∧ not'(e2, e3) ∧ tall'(e3, x) But P ∧ Q ∧ R implies P ∧ R, so it would seem that John is not tall implies John is tall.The wrinkle is that tall'(e3, x) does not say that x is tall. It says that e3 is a possible eventuality of x's being tall. The eventuality e3 may or may not exist in the real world, and if it does, that is one of its properties – Rexist(e3).This means that we have to distinguish between the content of a sentence and its claim. Sentences (6) and (8) have highly overlapping content. But the claim of sentence (6) is e3, the tall-ness, while the claim of sentence (8) is e2, the negation of the tall-ness.The general procedure for deciding on whether or not an eventuality really exists is as follows: Step 1: Identify the claim.Step 2: Propagate truth and falsity through implicatives.Step 3: As a courtesy to the speaker, assume the other propositions are true. (But note that in modal contexts there is an ambiguity in whether the grammatically subordinated material holds in the real world [de re] or in the modal context [de dicto].) For example, in (10) The lazy man did not manage to avoid attending the meeting. Step 1 says the claim is the "not." Step 2 says that therefore "manage" is false, "avoid" is false, and "attend" is true. Step 3 says that "lazy," "man," and "meeting" are all true.This kind of representation has the advantage of yielding a very elegant view of compositional semantics. In traditional approaches to compositional semantics, the meanings of constituents are lambda expressions, and composition happens by function application. With a flat logical form, the only role function application plays is identifying variables with each other. This gives us a two-part account of compositional semantics. 1. The lexicon provides predicate–argument relations.2. Syntax identifies variables. For the sentence (11) The man attended the meeting. ignoring the and tense, we get from the individual words the propositions (12) man'(e1, x1), attend'(e2, x2, y2), meeting'(e3, y3) When we recognize that attended the meeting is a verb phrase, this amounts to recognizing that y2 = y3. When we recognize the man attended the meeting as a clause, we have recognized that x1 = x2.In 1979 and 1980, I had the huge good fortune to participate in a biweekly discussion group on discourse, alternating between Stanford and Berkeley, consisting of some of the most illustrious scholars of language in the world, including Mike Agar, Dwight Bolinger, Eve and Herb Clark, Chuck Fillmore, Paul Kay, George Lakoff, Geoff Nunberg, Ivan Sag, Dan Slobin, Elizabeth Traugott, and Tom Wasow. For me personally, the high point in these meetings, and one of the high points in my entire career, was when the sociologist Irving Goffman, visiting Berkeley at the time, used my paper "Conversation as Planned Behavior" (Hobbs and Evans 1980) as a club to beat the sociolinguist John Gumperz over the head with. Metaphorically speaking. We read and discussed members' papers on interpreting nominal compounds, metonymy or deferred reference, de-nominalized nouns, metaphor, and other phenomena that came to be clustered by linguists under the name of "Radical Pragmatics" (Cole 1981). (I thought a better name would be "Run-of-the-mill AI".)Around this time, I was concerned with the problem of how we delimit the set of inferences we draw as we understand a text. The answer that seemed most promising was that we need to draw those inferences required to resolve interpretation problems of the sort we were examining in the discussion group. But what systematicity was there to this set of problems? How would you know if your list was complete?The scheme that made the most sense to me goes like this. A text conveys predications, that is, a predicate applied to one or more arguments – p(x). This gives rise to three sorts of problems: 1. What is the predicate? What is p? This question subsumes the problems of lexical ambiguity, the interpretation of vague predicates like prepositions and have, and the interpretation of the implicit relation in nominal compounds.2. What is the argument? What is x? This question subsumes the problems of coreference and syntactic ambiguity. (Recall that syntactic structure is a matter of identifying variables in the right way.)3. In what way are the predicate and argument congruent? What about p and x would allow p to be true of x? This question subsumes the problems of metaphor and metonymy. This collection of problems I called "local pragmatics." They are problems that are presented within the scope of single sentences, but they often require for their solution the entire discourse, the external context, and world knowledge. (My term never caught on probably because no one else saw this class of problems as a natural kind.)Another issue I was thinking about during these years was the structure of discourse, in particular, that structure arising out of coherence relations between discourse segments. In this I was very much influenced by the work of the linguists Joseph Grimes (1975) and Robert Longacre (1976). I began collaborating with the anthropologist Mike Agar around this time, and we called this level of structure "local coherence" (Agar and Hobbs 1982).In the mid-1970s Ray Perrault and Phil Cohen (Cohen and Perrault 1979) at the University of Toronto, later to be my colleagues at SRI, and Chip Bruce (Bruce and Newman 1978) at BBN were doing very exciting work analyzing the structure of discourse as arising out of the speaker's or writer's plan, employing formalizations of planning from artificial intelligence. In work with David Evans and work with Mike Agar I tried to apply these insights to the complexities of ordinary conversation and to ethnographic interviews. Agar and I called this level of structure "global coherence."All along in investigating all three of these problems—local pragmatics, local coherence, and global coherence—it was clear that a key role was played by the notions of implicature (Grice 1975), accommodation (Lewis 1979; Thomason 1985), and abduction (Peirce 1955). To solve even elementary problems like pronoun coreference, one had to make assumptions to get a good interpretation of the text, where the only justification for the assumptions was that they led to a good interpretation.In the fall of 1987 at SRI we organized a discussion group on abduction, reading the classic papers by Peirce, recent attempts in AI to use abduction in, for example, medical diagnosis (Pople 1955; Cox and Pietrzykowski 1986), and contemporary philosophers like Paul Thagard (1978), as well as work by Wilensky and Norvig at Berkeley (Wilensky 1983; Norvig 1987) and Charniak and Goldman at Brown (Charniak and Goldman 1988) that seemed to be taking an approach similar to ours. Among the people in our group were Mark Stickel, Doug Edwards, and the pragmatics scholar Steve Levinson, who was visiting Stanford at the time. We argued about what we were calling identity implicatures and referential implicatures, and about how to distinguish new from given information in discourse, and how to choose the best interpretation of a text.Then late one afternoon in October 1987 Mark Stickel came into my office to say that he thought he had the answer to all our problems. He described his algorithm for weighted abduction. It struck me immediately as the double helix of computational linguistics, a feeling that has not entirely abandoned me today. First of all, it gave us a characterization of what constituted the interpretation of a stream of discourse. It gave us a clear criterion for what inferences to draw and not draw. The interpretation was the most economical explanation for what would make the text true, and an inference was appropriate if and only if it contributed to that explanation.On my way home that night, I began driving a little more carefully. In the next few days, I saw how one would approach all the local pragmatics and local and global coherence problems in this framework. In discussions with Stu Shieber in the next few days it became apparent how one could integrate syntax smoothly into the framework. A big picture emerged (Hobbs et al. 1993).In the early 1990s I saw an advertisement in a magazine for Polaroid cameras (quite obsolete now). It showed a man standing by the ocean, holding a camera, and looking at a scene in which the branch of a tree is on the ground and a small boat is stuck in the top of another tree. When we see this, we immediately interpret it by coming up with the best explanation for the observables (abduction). There was a storm that blew the branch down and blew the boat into the tree. There are other possible explanations. Maybe someone chopped the branch down, and maybe the boat was lifted into the tree with a crane. But this is not as good an interpretation because we have to assume two things (the chopping and the crane) rather than just one (the storm). The first interpretation is better because it is more economical. Less explains more.But this isn't the end of the story. There is another observable to be explained. Why is this picture in the magazine? The explanation is that it is an advertisement. That means there was an ad agency involved in posing the picture, and they very well could have done the chopping and used the crane, rather than wait for the rare event of a storm to arrange the picture for them.We could call the first explanation the "informational" one. It explains the content of the picture, thereby explicating the information conveyed by the picture. We could call the second explanation the "intentional" one. It explains why the message occurs at all. Note that both interpretations need to be discovered if the advertisement is to be fully appreciated.The big picture that emerges is this (see Figure 1). The brain is an abduction machine, continuously trying to prove abductively (i.e., by making necessary assumptions) that the observables in its environment constitute a coherent situation. (We can encompass action as well as perception by adding to what is proved the proposition that the owner of the brain will thrive in that situation.)Sometimes among the observables is another agent's utterance. What is to be explained is the proposition utter(i, u, w)—that is, a speaker i utters to a hearer u a string of words w. Generally the best explanation for an utterance is that it is an intentional act aimed at conveying information. We can capture this with the axiom (13) Segment(w, e) ∧ goal(i, c) ∧ cog'(c, u, e) ⊃ utter(i, u, w) That is, if w is an interpretable segment of discourse describing a situation e, and a speaker i has the goal c that a hearer u adopt some cognitive stance toward e, then (defeasibly) i will utter to u the string of words w. The first conjunct in the antecedent is the entry point into the informational side of an interpretation: What is the content of the message? The second two conjuncts are the entry point into the intentional side: Why is the speaker conveying this content?The reason that the speaker has this particular goal is usually that it plays some role in, or is a subgoal of, a larger plan the speaker is executing in the world. This is where that reasoning occurs. It encompasses what Agar and I called "global coherence"—how does the utterance fit in with what else is going on in the world?The next level of analysis happens when we decompose the segment of discourse into smaller segments, using the axiom (14) This axiom says that if w1 is a segment describing situation e1, and w2 is a segment describing situation e2, and there is a relation between e1 and e2, then the concatenation is a segment describing a situation e somehow derivable from the relation. When we backchain on this axiom, we are explaining an interpretable segment of discourse by breaking it into parts, explaining the parts, and explaining the relation between them.The possible coherence relations are just the sort of relations that frequently obtain between two states or events: causality, similarity, identity, a strong sort of temporal succession I have called "occasion," the figure–ground relation, and predicate–argument relations. These are similar to other catalogues of discourse relations that others have come up with. However, the intent is to capture the information that can be conveyed by adjacency. By contrast, the relations of Rhetorical Structure Theory (Mann and Thompson 1976) are a mixture of informational relations like similarity and intentional relations like justification. The first is what is conveyed by adjacency; the second is what the speaker is using adjacency to do. Often the coherence relation conveyed by adjacency is expressed redundantly (and with less ambiguity) in a conjunction (so), an adverb (consequently), or a referential expression (That made …). This does not pose a problem, assuming the two do not conflict; discourse is rife with redundancy.Decomposition of a discourse in this fashion yields a tree or tree-like structure. It bottoms out in individual clauses, and this is where syntax takes over. Adjacency in larger stretches of discourse can convey a variety of possible relations. As we saw at the end of Section 2, adjacency within clauses conveys predicate–argument relations. Syntax is a set of rules that enable us to convey and interpret complex predicate–argument relations with the rather crude device of concatenation. The best explanation of a clause is the decomposition given to us by compositional semantics. The best explanation for an individual morpheme is that it is intended to convey its corresponding predication. Thus, the syntactic analysis of a clause bottoms out in its logical form.Now all that remains to be explained is the logical form. It was the original insight of the "Interpretation as Abduction" framework that the best abductive proof (i.e., the best explanation) of the logical form solved the local pragmatics problems as a side effect. I won't make an extended argument for that here, but one example should convey the basic idea.The sentence, due to Hirst (1987), (15) The plane taxied to the terminal. has three lexical ambiguities. A plane could be an airplane or a wood-smoother, a terminal could be an airport terminal or a computer terminal, and taxiing could be a plane moving on the ground or a person riding in a cab.We assume we have axioms expressing these possibilitiestogether with a rule that says airports have airplanes and airport terminals.Then the most economical explanation (Figure 2) is constructed by assuming there is an airport and that an airplane we expect to find there is moving on the ground to the airport terminal we expect to find there. Note that the ambiguous words are disambiguated as a by-product by virtue of the axioms that are used in the explanation. The predicate airport-terminal plays a role; the predicate computer-terminal doesn't.All of this raises a question. If the framework is so elegant and so all-encompassing, why isn't it more widely adopted?I think there are three reasons for this, historically. 1. Parsers were not accurate enough to produce good logical forms from which inference could start.2. Algorithms for abduction were too inefficient.3. There was a lack of an adequate knowledge base.Each of these problems has been alleviated somewhat in the past few years. There are now highly accurate statistical parsers, and for several of these (e.g., Boxer; Bos 2008) a component for translating into a flat logical form has been implemented.Recent work by Naoya Inoue and Kentaro Inui (2011) implements weighted abduction as a problem in integer linear programming, building on earlier work by Charniak and Santos (Santos 1996). Our experience with this is that when we switched from a naive backchaining implementation to the ILP implementation, we got a speed-up of two orders of magnitude.Finally, there have been ongoing efforts to build large knowledge bases, manually and automatically, from a number of different perspectives. Efforts to use Cyc for natural language processing applications have had mixed success at best. But Schubert's efforts (2002) to build a knowledge base by analyzing language use looks very promising. Some applications have attempted to use OpenMind. WordNet hierarchies are used very widely and Harabagiu and Moldovan (2002) developed XWN, a conversion of WordNet glosses into logical axioms, and reported success with its use in question-answering. FrameNet has been converted into logical axioms by Ovchinnikova et al. (2013), and she and her colleagues have shown that an abduction engine using a knowledge base derived from these sources is competitive with the best of the statistical systems in textual entailment and se

Ver no editor

Altmetric

PlumX

Entrar

Lembrar minha senha

Receber meu e-mail de confirmação

Influences and Inferences