Statistical Approaches to Automatic Text Summarization
2004; Association for Information Science and Technology; Volume: 30; Issue: 4 Linguagem: Inglês
10.1002/bult.319
ISSN2163-4289
Autores Tópico(s)Advanced Text Analysis Techniques
ResumoAn IBM software expert observed not long ago that we are producing more data every three years than humankind has in its entire history. Now, data doesn't necessarily correspond to information, let alone knowledge, but what is certain is that every day, information consumers are bombarded with more facts from more sources than they are capable of taking in. Information can be made more digestible in a number of ways. It can be compressed into a briefer format to enable the user to absorb the information quickly. It can merely point the user to a fuller account if he or she is interested. It can tip the user to whether a piece of information will be worthwhile. Or, it can pull similar or related sources together into a single summation. Whether the approach is informative, indicative, critical or aggregative, respectively, some reductive process has to happen. To have human beings doing the reducing is time-consuming and expensive, so after many dormant years the more than 40-year-old quest to summarize text automatically has been getting a lot of renewed attention. There are two fundamental approaches to automatic text summarization that represent the endpoints of a continuum. Both result in the compression of text, but one is relatively shallow, while the other is deep and complex. While this analysis will focus primarily on the former, it is worthwhile to understand the issues of both. On the least-complex end is summarization through text extraction, the creation of summaries using terms, phrases and sentences pulled directly from the source text using statistical analysis at a surface level. Occurrences of words or sentences are counted and analyzed according to their frequency and where they appear and reappear in the source text. This is sometimes referred to as "knowledge-poor" processing and is rooted in the term-weighting algorithms of information retrieval (IR). The other, more complex and "knowledge-rich" endpoint is summarization through abstracting. The aim here is to turn a computer-generated analysis and synthesis of the source material into a completely new, shorter text that is still cohesive and intelligible, one that reads, in other words, as though a human had written it, and at the same time fulfills the specific information need of the user. This process is sometimes known as machine understanding, a multidisciplinary endeavor involving information retrieval, linguistics and artificial intelligence. An information-seeker's query is posted against an array of tools based on domain knowledge, comprising sets of predetermined and preprogrammed information about an area of inquiry. These include, but are not limited to, ontologies, vocabularies, thesauri and a thorough understanding of how text is characteristically structured within that domain. In simplest terms, automatic abstracting is fact extraction, compared to mere text extraction performed by statistical methods. The difficulties of fact extraction will become clearer as we investigate examples of automatic summarization. In "Automatic Summarizing: Factors and Directions," a 1999 position paper on the state of the art, Karen Sparck Jones drew a critical distinction between text extraction and fact extraction. In the former, she wrote, "what you see is what you get," because part or parts of source text are extracted verbatim with no prior determination of what might actually be important. It is thus an "open" approach that determines "importance" mechanically and, she notes, not especially well. The extracted text may well be incoherent if pronouns, synonyms or other ambiguous terms aren't sufficiently resolved. Fact extraction, on the other hand, is "closed" insofar as the process requires pre-established, domain-determined parameters for machine processing. A significant amount of preprocessing of both source material and potential queries must occur to point to the kind of information that will be sought — for example, patterns of names, locations and activities that suggest terrorism. Sparck Jones points out, "What you see is what you know," because the system assembles facts based on these requirements and may altogether ignore what the authors of the source material felt was important. The up-front development of ontologies, controlled vocabularies, thesauri and other agents of understanding are what make this automated abstracting so challenging and what will keep researchers engaged for many years to come. But the need to deal with information overload is immediate, which is why the text extraction model is being so actively pursued. For one thing, because it does not require creation of an entirely new text, expectations for it are quite a bit lower. Without the need for thesauri, an understanding of text structure or narrative form (known as discourse analysis), linguistic research and complex ontologies, text extraction is a less expensive pursuit. Moreover, the IR methods behind text extraction lend themselves better to collections of diverse, loosely structured documents, like big news databases or the World Wide Web. The first serious look at automatic summarization using statistics was undertaken in 1958 by H. P. Luhn, who used an IBM 704 data processor to analyze word occurrences in a scientific article. His algorithm removed stop words, clustered similar words and counted word frequencies. From this he derived "significant" words and their distance from other significant words, and used the results to mark certain complete sentences for extraction into a printed abstract. Assigning certain words a weight based on how frequently they show up in a text is a key component of statistical text extraction, and subsequent research built on that concept. In the late 1960s H. P. Edmundson developed the use of text features to assign weights and to extract summary information. Not only did he look for statistically significant words (called keywords), he noted their location in a text, taking advantage of certain text structures common to the documents in his source material — chemistry papers of various lengths. Words appearing in the title or a subhead or located in the first two paragraphs were good pointers to useful sentences, as were what he called cue words — phrases like "in conclusion" or "in this paper," which tended to be parts of summary statements. He also took into account bonus words like "significant" or "noteworthy" and stigma words like "impossible" or "unimportant." Edmundson also made the effort to improve his algorithm by comparing his extracted summaries with human-created abstracts and using the results to refine the algorithm. These tuning constants were an early form of system "learning," which more recent researchers have tried to integrate into extraction systems. They have included training algorithms that learn from pairs — tuples — of documents and their manually produced abstracts how to classify sentences for inclusion or exclusion from a summary. Text summarization research slowed considerably in the late 1970s and 1980s, as researchers moved on to more readily solvable problems; for example, that period saw quite a bit of investigation into the field of automatic indexing. But the advent of the Web and their relative ease of implementation gave IR methods a new cachet in the 1990s. In addition to term-proximity research building on Edmundson, other currently promising avenues include statistical analysis of term clustering, statistically based analysis of text structure, or discourse analysis, and training algorithms that use human-generated abstracts to determine probabilities that certain source-text sentences would appear in an automated summary. Each of these approaches represents a point on the continuum drawing nearer to full-text understanding. The following examples of recent initiatives illustrate some of the key approaches to statistical processing. Automatic indexing research by Gerard Salton of Cornell University and others in the 1970s and 1980s evolved into statistical processing methods based on tf.idf weighting, which applies significance to a term by counting the number of times it appears in a document (tf, the term frequency) and multiplying the result by the term's inverse document frequency (idf) — a logarithmic calculation of the total number of documents in the collection, divided by the number of documents containing the target term. In the 1990s, Salton used tf.idf weights and other measures derived from indexing research to identify closely related segments within a document and then to compare those relationships with those of other documents, generating automatic hyperlinks when the similarities were close. Building on that, the analysis of similarities among various combinations of paragraphs within a document can reveal relatedness — where a topic is reinforced elsewhere in the document — or unrelatedness, suggesting the beginning of a new topic or angle. Added together, these intra-document results are able to suggest an overall text structure without the need for the complex linguistic theory often associated with discourse analysis. Even more interesting, these internal links can be compared to a query and an extracted summary constructed at retrieval time: summarization on the fly, tailored to the user's particular information need. In 1997 Salton evaluated automatic text construction against human-constructed abstracts and found that measuring the amount of overlap between source document and abstract, the two were nearly identical. However, it should be noted that he used IR graduate students, not professional abstract writers, to produce the summaries. In the area of machine training, Julian Kupiec and others (1995) employed an analysis technique that allows for a recalculation of probabilities as "learning" progresses — known as Bayesian statistics. The probabilities in question were the likelihood that a given sentence in a source text should be included in a summary, based on the frequency of text features. The analysis also provided various categories of matches between the source text and summary, including a direct match, where summary sentence and source sentence are identical, and direct join, where two source sentences are combined into a single summary sentence. In tests of the Bayesian algorithm, 84 percent of the machine summaries overlapped with sentences in the manual summaries at a 25 percent compression of the source text, which was double the overlap that Edmundson cited at the same rate of compression. The optimal set of features for Kupiec turned out to be a combination of location, cue phrase and sentence length. Follow-on experiments with Korean texts indicated that the Bayesian approach was language-independent. Eduard Hovy and Chin-Yew Lin (1999) worked on discovering the best locations for picking out abstract-worthy sentences, using an existing concept thesaurus to provide rudimentary interpretation of sentences selected through a topic-identification routine. They use tf.idf weightings and other tools to develop topic-rich keywords from a training collection of 13,000 newspaper articles covering technology industry announcements, and from that they developed ranked lists of sentences that contained topical terms. Unlike other document types, news stories have a fairly predictable structure, with the important information typically at the beginning of the article, but this can vary according to editing practices at different publications. In the collection of technology stories, the title (headline) was the optimal place for locating usable terms, followed by the first sentence of the second paragraph. A second test of 30,000 general-interest Wall Street Journal articles revealed that the title was optimal, followed by the first paragraph. In the case of the technology stories, journalists tended to "tease" a new product announcement in the initial sentences with abstract language, reserving the facts for the second paragraph. In the Wall Street Journal, different editing standards resulted in the salient facts being included in the first paragraph. For all the promise of these methods, the problems associated with automatic summarization based on text extraction are still significant, 30 years after Edmundson noted them. The following are among the problems. Extracted sentences often need at least some human editing to smooth the language. Salton did research with statistically generated "transitions" — useful phrases or sentences from the source text that act to smooth a shift in topics in a summary — but results were spotty. Better, more satisfactory approaches to smoothing and revising the text require some understanding of discourse structure and domain knowledge. The twin problems of unresolved anaphors (such as pronouns, which refer back to words earlier in the text) and cataphors (ambiguous words signaling a term that shows up later in the text) are especially thorny. A common solution appears to be simply to delete the dangling references from the summary, or, failing that, to pick up the preceding or subsequent sentence from the source text and hope that the anaphor or cataphor is resolved. A more reliable solution would require linguistic analysis well beyond the scope of a pure IR approach. Resolving rhetorical devices in text is a similar problem. Cue phrases like "on the other hand" or "on the contrary" may creep into a summary and set up an opposing statement that does not exist in the source text. Again, inserting the preceding or subsequent sentence from the source text into the summary may resolve the reference, but it is an uncertain solution at best. Disambiguating collection-wide term lists without a degree of domain knowledge, including thesauri and established vocabularies, is a difficult proposition. Synonyms like truck-lorry or elevator-lift require thesauri and query expansion, whereas the polysemy problem — multiple meanings for words like "lead" or "book" — requires that the system understand the context of the term, which may follow not only from the source text but ultimately from the user's information need. Susan Feldman has pointed out that queries to the Web about war will turn up a lot of sports articles, given the writers' tendency to describe the playing field in terms of battles, routs and crushing defeats. Content that stretches the definition of "text" is problematic for automated extraction, including tables, itemized lists, equations, diagrams, and so on. If a source title reads, "The Candidate's Top Four Proposals." the extraction algorithm would have to be intelligent enough to pick out all four from the following text or fail as a summarization tool. It is often difficult to find a suitable body of texts with which to develop learning algorithms and conduct trials. Collections of documents with human-produced abstracts are important for evaluating the machine-generated versions but aren't always available. Even when they are, researchers must try to control for variations in quality; author-supplied abstracts, for example, are frequently less systematic or structured than those written by professional abstractors. (Abstracts developed by the researchers' own graduate students, with their insiders' knowledge, add another interesting variable.) Regardless of the source, proper evaluation demands some kind of "ideal" abstract with which to make comparison — but does ideal mean merely that it overlaps the source text to a satisfactory level, or does it mean that it actually meets a user's information need? If so, did the user need a quick summary, a pointer to additional resources, a critique of a resource or what, exactly? Solving these problems requires some level of linguistic analysis and domain knowledge, and thus they present the threshold where information retrieval methods and statistical processing alone begin to falter. The breakpoint between an extract and an abstract is how successfully an idea has been conveyed in a summary: The summary is not created from mere words and phrases picked out of the source text, but from a generalization of those terms into coherently compressed text. The "knowledge-poor" approach, then, is stuck with the mechanical components of text, and it takes ontologies and thesauri, and deep understanding of text structure, to advance into semantic, syntactic or discourse-level understanding. While it is an improvement over the "bag of words" of mere indexing, a program to produce a statistically generated summary, as Eduard Hovy pointed out, will scarcely understand "that the sequence enter + order + wait + eat + pay + leave can be summarized as restaurant visit ." For the near term, though, most summarization techniques will rely on IR and statistical processing, and there is plenty of groundwork still to be done. Text extraction is the focus of a number of interesting research areas, including the following: Summarizing multiple documents. This research scales the single-document techniques described above to collections of any size: identifying what multiple documents may have in common, where they differ and which documents are unique. Features like location, time frame, proper names and other more domain-specific data can be used in a template structure to summarize, for example, terrorist events. Exploiting structure in HTML. The minimal markup provided by HTML tagging — heads, subheads, and so on — is statistically analyzed to gather topic threads into a concise summary of both individual documents and larger bodies of text, including the Web. Language translation. Terms from a collection comprising multiple languages are selected for translation based on how frequently they occur and analyzed in aggregate to suggest a list of topic terms. While lacking complete sentences and narrative form, the terms might permit at least a basic understanding of the source text or point to a need for a more complete translation of certain documents. Additional statistical analysis suggests a coherent ordering of words and phrases, rather than an unstructured, alphabetical jumble. Exploiting rhetoric. Statistics are used to analyze statements contained in scientific papers that refer to a document's new contribution to the field and its relationship to previous research. The extracted segments have a good probability of pointing out key, summary-worthy sentences in the source text. Multimedia applications. These apply text analysis techniques to audio transcripts and closed-caption feeds, but also take into account indirect cues such as silences, change of speaker, handoffs from reporter to anchor and logo recognition. As voice-recognition technology improves, this approach can be combined with text analysis; for example, key frames of significant images can be extracted from video and combined with a textual summary. In spite of the numerous drawbacks described here, statistically based text extraction is an important strategy in the creation of automated summaries. Research is directed at improving relevance and coherence, as well as gaining a better understanding of discourse structure. The direction of research is increasingly at the interface between statistical analysis and linguistic analysis, both still fairly young technologies. Collection-based training and link analysis of documents are both statistical methods of arriving at a reasonable model of discourse structure — a basic requirement for turning an extract into an abstract. Meanwhile, a growing body of domain knowledge from many disciplines is gradually entering the mix. Still, there are many questions to answer before the relative ease and flexibility of text extraction combines with the informational power of fact extraction to produce summaries that meet the needs of any user. Hovy and Lin gave voice to the excitement in this newly rediscovered field. "It is so difficult that interesting headway can be made for many years to come," they wrote in 1999. "We are still excited about the possibilities offered by the combination of semantic and statistical techniques in what is surely one of the most complex tasks in all of natural language processing." And while researchers continue to sort out the complexities of automated abstracting, text extraction still holds a lot of promise. Karen Sparck Jones pointed out that the ability simply to scan a less-than-coherent summary may be enough for certain patient searchers, those "tolerant but rational users with loosely defined tasks" — especially those with information overload. Victoria McCargar is Senior Editor, Library Projects, in the Los Angeles Times Editorial Library, 202 West First Street, Los Angeles, CA, 90012; 213-237-7129; e-mail: vicky.mccargar@latimes.com
Referência(s)