TREC: Improving information access through evaluation

2006; Association for Information Science and Technology; Volume: 32; Issue: 1 Linguagem: Inglês

10.1002/bult.2003.1720320105

ISSN

2163-4289

Autores

Ellen M. Voorhees,

Tópico(s)

Data Quality and Management

Resumo

“If you can not measure it, you can not improve it.” —Lord Kelvin Evaluation is a fundamental component of the scientific method: researchers form a hypothesis, construct an experiment that tests the hypothesis and then assess the extent to which the experimental results support the hypothesis. A very common type of experiment is a comparative one in which the hypothesis asserts that Method 1 is a more effective solution than Method 2, and the experiment compares the performance of the two methods on a common set of problems. The set of sample problems together with the evaluation measures used to assess the quality of the methods' output form a benchmark task. Information retrieval researchers have used test collections, a form of benchmark task, ever since Cyril Cleverdon and his colleagues created the first test collection for the Cranfield tests in the 1960s. Many experiments followed in the subsequent two decades, and several other test collections were built. Yet by 1990 there was growing dissatisfaction with the methodology. While some research groups did use the same test collections, there was no concerted effort to work with the same data, to use the same evaluation measures or to compare results across systems to consolidate findings. The available test collections were so small – the largest of the generally available collections contained about 12,000 documents and fewer than 100 queries – that operators of commercial retrieval systems were unconvinced that the techniques developed using test collections would scale to their much larger document sets. Even some experimenters were questioning whether test collections had outlived their usefulness. At this time, NIST was asked to build a large test collection for use in evaluating text retrieval technology developed as part of the Defense Advanced Research Projects Agency's TIPSTER project. NIST proposed that instead of simply building a single large test collection, it organize a workshop that would both build a collection and investigate the larger issues surrounding test collection use. This was the genesis of the Text REtrieval Conference (TREC). The first TREC workshop was held in November 1992, and there has been a workshop held annually since then. The cumulative effort represented by TREC is significant. Approximately 250 distinct groups representing more than 20 different countries have participated in at least one TREC, thousands of individual retrieval experiments have been performed and hundreds of papers have been published in the TREC proceedings. TREC's impact on information retrieval research has been equally significant. A variety of large test collections have been built for both traditional ad hoc retrieval and new tasks such as cross-language retrieval, speech retrieval and question answering. TREC has standardized the evaluation methodology used to assess the quality of retrieval results and, through the large repository of retrieval runs, demonstrated both the validity and efficacy of the methodology. The workshops themselves have provided a forum for researchers to meet, facilitating technology transfer and discussions on best practices. Most importantly, retrieval effectiveness has doubled since TREC began. This article provides a brief introduction to TREC. After an initial section that describes how TREC operates, the article summarizes the impact TREC has had in the areas of retrieval system effectiveness, retrieval system evaluation and support of new retrieval tasks. TREC is sponsored by the U.S. National Institute of Standards and Technology (NIST) with some support from the U.S. Department of Defense. Participants in TREC are retrieval research groups drawn from the academic, commercial and government sectors. TREC assumes the Cranfield paradigm of retrieval system evaluation, which is based on the abstraction of a test collection: a set of documents, a set of information needs that TREC calls topics and a set of relevance judgments that say which documents should be retrieved for which topics. For each TREC, NIST supplies a common set of test documents and a set of 50 topic statements. The format of the topic statements has varied over the years, but generally consists of at least a brief natural language statement of the information desired (e.g., What are the economic benefits of recycling tires?). Participants use their systems to run the topics against the document collection and return to NIST a list of the top-ranked documents for each topic. Since TREC document sets contain an average of about 800,000 documents, they are too large for each document to be judged for each topic. Instead, a technique called pooling is used to produce a sample of the documents to be judged. For each topic, the pool consists of the union of the top 100 documents across all runs submitted to that TREC. Because different systems tend to retrieve some of the same documents in the top 100 retrieved, this process tends to produce pools of approximately 1500 documents. The documents in a pool are then viewed by a human and judged as to whether they are relevant to the topic. Once all the relevance judgments for all of the topics in the test set are complete, NIST evaluates the retrieval runs on the basis of the relevance judgments and returns the evaluation results to the participants. A TREC cycle ends with the workshop that is a forum for participants to share their experiences. The first two TRECs had two tasks, the ad hoc task and the routing task. The ad hoc task is the prototypical retrieval task such as a researcher doing a literature search in a library. In this environment, the system knows the set of documents to be searched (the library's holdings), but cannot anticipate the particular topic that will be investigated. In contrast, the routing task assumes the topics are static but need to be matched to a stream of new documents. The routing task is similar to the task performed by a news-clipping service or a library's profiling system. Starting in TREC-3, additional tasks, called tracks, were added to TREC. The tracks serve several purposes. First, tracks act as incubators for new research areas. The first running of a track often defines what the problem really is, and a track creates the necessary infrastructure such as test collections and evaluation methodology to support research on its task. The tracks also demonstrate the robustness of core retrieval technology in that the same techniques are frequently appropriate for a variety of tasks. Finally, the tracks make TREC attractive to a broader community by providing tasks that match the research interests of more groups. As mentioned, the original catalyst for TREC was the request to create a large test collection, but that goal was broadened to standardizing and validating evaluation methodology for retrieval from realistically sized collections. A standard evaluation methodology allows results to be compared across systems – important not so there can be winners of retrieval competitions, but because it facilitates the consolidation of a wider variety of results than any one research group can tackle. TREC has succeeded in standardizing and validating the use of test collections as a research tool for ad hoc retrieval and has extended the use of test collections to other tasks. This section summarizes the support for this claim by examining three areas: the test collections, the trec_eval suite of evaluation measures and two experiments that confirm the reliability of comparing retrieval effectiveness using test collections. Through the pooling process described above, TREC has created a set of test collections for the English ad hoc task. In the aggregate, the collections include five disks of documents, each containing approximately one gigabyte of English text (largely news articles but also including some government documents and some abstracts of scientific papers) and nine sets of 50 topics. Each topic has a set of manual relevance judgments for the corresponding document set. The collections are publicly available (see http://trec.nist.gov/data.html) and are now the collections of choice for most researchers working in basic retrieval technologies. The addition of tracks to TREC allowed the creation of test collections for other tasks, as well. Collections have been made for languages other than English, media other than text and tasks that range from factoid question answering to text categorization. In each case the test collections have been integral to progress on the task. Scoring the quality of a retrieval result given the system output from a test collection has been standardized by the trec_eval program written by Chris Buckley (see http://trec.nist.gov/trec_eval). Trec_eval provides a common implementation for over 100 different evaluation measures that ensures issues such as interpolation are handled consistently. A much smaller set of measures has emerged as the de facto standard by which retrieval effectiveness is characterized. These measures include the recall-precision graph, mean average precision and precision at ten retrieved documents. One objection to test collections that dates back to the Cranfield tests is the use of relevance judgments as the basis for evaluation. Relevance is known to be very idiosyncratic, and critics question how an evaluation methodology can be based on such an unstable foundation. An experiment using the TREC-4 and TREC-6 retrieval results investigated the effect of changing relevance assessors on system comparisons. The experiment demonstrated that the absolute scores for evaluation measures did change when different relevance assessors were used, but the relative scores between runs did not change. That is, if system A evaluated as better than system B using one set of judgments, then system A almost always evaluated as better than system B using a second set of judgments (the exception was in the case where the two runs evaluated as so similar to one another that they should be deemed equivalent). The stable comparisons result held for different evaluation measures and for different kinds of assessors and was independent of whether a judgment was based on a single assessor's opinion or was the consensus opinion of a majority of assessors. The use of pooling where only some documents are judged for a topic and all unjudged documents are treated as not relevant was another source of concern. Critics feared that runs that did not contribute to the pool would be unfairly penalized in the evaluation because those runs would contain highly ranked unjudged documents. Examination of larger pools did confirm one aspect of the critics' fears – there are unjudged documents remaining in the collections that would have been judged relevant had they made it into the pools. Further, the quality of final test collection does depend on the diversity of the runs that contribute to the pools and the number of documents selected from each run. As an extreme example, pools created from only the top-ranking document from each of 30 runs do not form a good test collection. But tests showed that the TREC collections are not biased against unjudged runs. In these tests, the documents uniquely retrieved by a TREC run are treated as not relevant when that run is evaluated. The difference in the evaluation results for runs evaluated both with and without their own uniquely retrieved relevant documents was smaller than the difference produced by changing relevance assessors. When TREC began there was real doubt as to whether the statistical systems that had been developed in the research labs (as opposed to the operational systems that used Boolean searches on manually indexed collections) could effectively retrieve documents from “large” collections. TREC has shown not only that the retrieval engines of the early 1990s did scale to large collections, but that those engines have improved since then. This effectiveness has been demonstrated both in the laboratory on TREC test collections and by today's operational systems that incorporate the techniques. Further, the techniques are routinely used on collections far larger than what was considered large in 1992. Web search engines are a prime example of the power of the statistical techniques. The ability of search engines to point users to the information they seek has been fundamental to the success of the Web. Improvement in retrieval effectiveness cannot be determined simply by looking at TREC scores from year to year. It is invalid to compare the results from one year of TREC to the results of another year since any differences are likely to be caused by the different test collections in use. However, developers of the SMART retrieval system kept a frozen copy of the system they used to participate in each of the eight TREC ad hoc tasks. After every TREC, they ran each system on each test collection. For every test collection, the later versions of the SMART system were much more effective than the earlier versions of the SMART system, with the later scores approximately twice that of the earlier scores ( see Figure 1). While these scores are evidence for only one system, the SMART system results consistently tracked with the other systems' results in each TREC, and thus the SMART results can be considered representative of the state-of-the-art. The improvement was evident for all evaluation scores that were examined, including mean average precision and precision and recall at various cut-off levels. Mean average precision for the SMART by year and task Retrieve a first set of documents using the original query. Assume the first X documents are relevant. Perform relevance feedback with that set of documents to create a new query (usually including both new terms and refined query weights). Return the results of searching with the new query to the user. Tokenization that regularizes word forms is generally helpful. The most common form of regularization is stemming, but normalizing proper nouns to a standard format can also be helpful. Simple phrasing techniques are generally helpful. The most helpful part of phrasing is the identification of common collocations that are then treated as a single unit. More elaborate schemes have shown little benefit. Appropriate weighting of terms is critical. The best weighting schemes reflect the discrimination power of a term in the corpus and control for document length. There are several different weighting schemes that achieve these goals and are equivalently effective. Language modeling techniques are not only effective but also provide a theoretical justification for the weights assigned. The two gigabytes of text that was considered massive in 1992 is modest when compared to the amount of text some commercial retrieval systems search today. While TREC has some collections that are somewhat bigger than two gigabytes – the Web track used an 18-gigabyte extract of the Web, for example – there is once again doubt whether research retrieval systems and the test collection methodology can scale to collections another three orders of magnitude larger. The terabyte track, a track introduced in TREC 2004, has been initiated to examine these questions. The TREC track structure enables TREC to extend the test collection paradigm to new tasks. Several of the TREC tracks have been the first largescale evaluations in that area. In these cases, the track has established a research community and created the first specialized test collections to support the research area. A few times, the track has spun-off from TREC and established its own evaluation conference. The Cross-Language Evaluation Forum (CLEF, see http://clef.iei.pi.cnr.it) and TRECVKid workshops (http://www.itl.nist.gov/iaui/894.02/projects/trecvid/) are examples of this. Other conferences such as NTCIR (http://research.nii.ac.jp/ntcir) and INitiative for the Evaluation of XML Retrieval (INEX, http://inex.is.informatik.uni-duisburg.de:2004) were not direct spin-offs from TREC, but were inspired by TREC and extend the methodology to still other areas. The set of tracks run in any particular TREC depends on the interests of the participants and sponsors, as well as on the suitability of the problem to the TREC environment. The decision of which tracks to include is made by the TREC program committee, a group of academic, industrial and government researchers who have the responsibility for oversight of TREC. Tracks are discontinued when the goals of the track are met, or when there are diminishing returns on what can be learned about the area in TREC. Some tracks run for many years but change focus in different years. Figure 2 shows the set of tracks that were run in the different years of TREC and groups the tracks by the aspects that differentiate them from one another. The aspects listed on the left of the figure show the breadth of the problems that TREC has addressed, while the individual tracks listed on the right show the progression of tasks within the given problem area. TREK tracks by year Space limitations prohibit going into the details of all of these tracks, and the interested reader is referred to the track overview papers in the TREC proceedings (available at http://trec.nist.gov/pubs.html). Instead, a few tracks that established new research communities are highlighted. Cross-Language Retrieval. One of the first tracks to be introduced into TREC was the Spanish track. The task in the Spanish track was a basic ad hoc retrieval task, except the topics and documents were written in Spanish rather than English. The track was discontinued when the results demonstrated that retrieval systems could retrieve Spanish documents as effectively as English documents. Another single language track, this time using Chinese as a language with a very different structure than that of English, was introduced next. Again, systems were able to effectively retrieve Chinese documents using Chinese topics. There are a variety of naturally occurring document collections, such as the Web, that contain documents written in different languages that a user would like to search using a single query. A cross-language retrieval system uses topics written in one language to retrieve documents written in one of a variety of languages. The first cross-language track was introduced in TREC-6 to address this problem. The TREC-6 track used a document collection consisting of the French and German documents from the Swiss news agency Schweizerische Depeschen Agentur plus English documents from the Associated Press of the same time period. Topics were generated in English and then translated into French, German, Spanish and Dutch. Participants searched for documents in one target language using topics written in a different language. Later versions of the track had participants search for documents using topics in one language against the entire combined document collection. Still later versions of the track used more disparate languages (English topics against either Chinese or Arabic document sets). The TREC cross-language tracks built the first large-scale test collections to support cross-language retrieval research and helped establish the cross-language retrieval research community. The track demonstrated that cross-language retrieval can be more effective than the corresponding monolingual results due to the expansion that results from translating the query. TREC no longer contains tasks involving languages other than English because there are now other venues for this research. The NTCIR and CLEF evaluations mentioned earlier offer a range of retrieval tasks with a multilingual focus. Spoken Document Retrieval. The dual goals of the spoken document retrieval track were to foster research on content-based access to recordings of speech and to bring together the speech recognition and retrieval communities. The track ran for four years, from TREC-6 through TREC-9, and explored the feasibility of retrieving speech documents by using the output of an automatic speech recognizer. The documents used in the track were stories from audio news broadcasts that were manually segmented into the component stories. Several other forms of the content, including manual transcripts and transcripts produced from a baseline recognizer, were also made available to track participants. The different versions of the broadcasts made it possible for participants to explore the effects of varying amounts of errors in the text—from (assumed to be) no errors for the manual transcripts through varying degrees of recognition errors associated with the baseline and participants' recognizer transcripts – on retrieval performance. Over the course of the track, researchers developed systems that achieved retrieval effectiveness on automatically recognized transcripts that was comparable to its effectiveness on the human-produced reference transcripts and demonstrated that their technology is robust across a wide-range of recognition accuracy. The worst effect of automatic recognition was out of vocabulary (OOV) words. Participants compensated for OOV words by using adaptive language models to limit the number of OOV words encountered and by expanding recognized text by related clean texts to include OOV words in the documents. The track also contributed to the development of techniques for near-real-time recognition of open vocabulary speech under a variety of non-ideal conditions including spontaneous speech, non-native speakers and background noise. Video Retrieval. After the success of the spoken document retrieval track, TREC introduced a video track to foster research on content-based access to digital video recordings. A document in the track is defined as a video shot. Tasks have included shot boundary detection, feature detection (where features are high-level semantic constructs such as “people running” or “fire”) and an ad hoc search task where the topic is expressed as a textual information need possibly including a still or video image as an example. The test set of videos has been derived from a variety of sources including broadcast news, training videos and recordings of scientific talks. Because there is no obvious analog to words for video retrieval, there has been little overlap in the techniques used to retrieve text documents versus video documents. Yet interest in the problem of video retrieval is high and increasing among researchers, content providers and potential users of the technology. To allow greater room for expansion than would be possible as a TREC track, the video track was spun off from TREC as the separate TRECVid workshop in 2003. TRECVid continues to date, with approximately 60 participating groups in TRECVid 2004. Question Answering. While a list of on-topic documents is undoubtedly useful, even that can be more information than a user wants to examine. The TREC question answering track was introduced in 1999 to focus attention on the problem of returning exactly the answer in response to a question. The initial question answering tracks focused on factoid questions such as “Where is the Taj Mahal?” Later tracks have incorporated more difficult question types, such as list questions (a question whose answer is a distinct set of instances of the type requested, such as “What actors have played Tevye in Fiddler on the Roof?”) and definition/biographical questions (such as “What is a golden parachute?” or “Who is Vlad the Impaler?”). The question answering track was the first large-scale evaluation of open-domain question answering systems, and it has brought the benefits of test collection evaluation observed in other parts of TREC to bear on the question answering task. The track established a common task for the retrieval and natural language processing research communities, creating a renaissance in question answering research. This wave of research has created significant progress in automatic natural language understanding as researchers have successfully incorporated sophisticated language processing into their question answering systems. Evaluating competing technologies on a common problem set is a powerful way to improve the state of the art and hasten technology transfer. TREC has been able to build on the textretrieval field's tradition of experimentation to significantly improve retrieval effectiveness and extend the experimentation to new sub-problems. By defining a common set of tasks, TREC focuses retrieval research on problems that have a significant impact throughout the community. The conference itself provides a forum in which researchers can efficiently learn from one another and thus facilitates technology transfer. TREC also provides a forum in which methodological issues can be raised and discussed, resulting in improved text retrieval research. More information regarding TREC can be found on the TREC website – http://trec.nist.gov – and in the book TREC: Experiment and Evaluation in Information Retrieval recently published by MIT Press.

Referência(s)