Automatic semantic mapping between query terms and controlled vocabulary through using WordNet and Wikipedia
2008; Wiley; Volume: 45; Issue: 1 Linguagem: Inglês
10.1002/meet.2008.1450450286
ISSN0044-7870
AutoresXiaozhong Liu, Jian Qin, Miao Chen, Ji‐Hong Park,
Tópico(s)Wikis in Education and Collaboration
ResumoQuery log analysis can provide valuable information for improving information retrieval performance. This paper reports findings from a query log mining project, in which query terms falling in the very long tail of low to zero similarity (with the controlled vocabulary) scores were analyzed by using similarity algorithms. The query log data was collected from the Gateway to Educational Materials (GEM). The limited number of terms in the GEM controlled vocabulary was a major source for the long tail of low or zero similarity scores for the query terms. To mitigate this limitation, we employed a strategy that involved using the general-purpose (domain-independent) ontology WordNet and community-created Wikipedia as the bridge to establish semantic relatedness between GEM controlled vocabulary (as well as new concept classes identified by human experts) and user query terms. The two sources, WordNet and Wikipedia, were complementary in mapping different types of query terms. A combination of both sources achieved a modest rate of mapping accuracy. The paper discussed the implications of the findings for automatic semantic analysis and vocabulary development and validation. Query logs record what users entered in the system when they search for information. The information captured can be useful for studying user search behavior when combining with session data. Numerous publications have reported analysis results from perspectives of search behavior, interactions, cognitive processes, algorithm enhancements, and clustering, which have been reviewed by Jansen and Spink (2006). The fact that query terms represent information needs of users provides a valuable source not only for studying user search behavior, but also for enriching or validating existing controlled vocabulary with new terms and topics. This paper focuses on analyzing the relationships between user query terms and controlled vocabulary, which includes three steps: 1) Preprocessing query log data from the Gateway to Educational Materials (GEM), 2) Processing query terms using WordNet and Wikipedia, and 3) Using different similarity algorithms to automatically mine the semantic relationships between query terms and controlled vocabulary/new knowledge classes. The following sections will first review relevant research, and then describe the methodology and experiment result. Implications of research findings are discussed in the conclusion section. Log analysis has many versions depending on the type of log data used and purpose of analysis. Access log is a source commonly used for evaluating access to and usage of Web pages and user behavior. By mining sequences and status of access to Web pages, researchers can discover the navigation patterns and generalize concept hierarchies (Spiliopoulou, 2000). Transaction logs originally refer to the history of actions executed by database management systems, but this term has also been used for query logs that are generated in information search systems (Jones et al., 2000; Jansen & Spink, 2006). Another term “query logs” also appears frequently in literature and is used interchangeably with “user logs” and the two mentioned above. Query logs keep information on what query terms a user entered, under what query options a query was executed, and how many results were returned to the user. Query log data may be associated with user's authentication information if login is required, and associated with web server log data to obtain access sequence information about queries. Although the data in these logs vary in format and detail, they are common in that all of them record users' interactions with the system and keep information about what actions users took to access and use the system. Query log analysis has been used as a strategy to improve the effectiveness of information retrieval because the log data reflects user search trends and predicts human interactions with searching engine in the future (Shokouhi, 2006). Researchers have studied frequencies of log data components such as the number of terms per query and frequency distribution of query terms (Jansen et al., 2000; Spink et al., 2001), or correlations among query terms and pairs of query terms (Silverstein et al., 1999). The statistical description of query log data have been used to study the features of user search behavior on commercial web search engines (Silverstein et al., 1999; Jansen et al., 1998). The findings from these studies show that typical search behavior today includes shorter queries than those used in traditional information retrieval systems and impatient result browsing (i.e., users rarely go more than the first page when checking search result). Query clustering and classification is often used in information retrieval to categorize queries in order to return relevant information to searchers. Kim and Seo (2005) developed an FAQ system using query log clustering. Their experiment showed that both precision and recall outperformed the traditional system, which demonstrated that query log clustering was a useful bridge between queries and FAQs (Kim & Seo, 2005). Query classification is another application of query log analysis in information retrieval. It is usually done by applying either supervised learning (Beitzel, et al. 2005) or unsupervised learning (Wen, et al. 2002; Beeferman, 2000) techniques. The supervised method refers to machine learning techniques such as Space Vector Machine and decision tree that are used together with human tagging as training data. In contrast, unsupervised learning focuses on computing similarity scores between queries without human judgment. The most frequently used algorithms including the K-means and expectation-maximization (EM) algorithm. Lexical-based approach (Leacock & Chodorow, 1998; Wu & Palmer 1994): measures similarity based on path length and depth on certain lexical resources. The closer the two concept nodes are in taxonomy, the higher the degree of similarity is. WordNet is frequently used by this approach. Corpus-based approach (Landauer & Dumais, 1997): measures similarity based on the probability or entropy calculated from corpus. The performance of this method is related to the corpus chosen. Some studies such as (Rapp, 2002) used syntagmatic or paradigmatic relationships to calculate the semantic similarity, which is also corpus-driven. Hybrid approach (Lin, 1998; Resnik, 1995; Jiang & Conrath, 1997): uses both lexical and corpus resources to calculate the similarity scores. Although query logs have been used to study user behavior and served as the source for clustering and classification for information retrieval, most of them came from commercial Internet search engines and the query terms included a wide range of subjects in which controlled vocabulary was rarely involved. While users of Internet search engines are generally not concerned about controlled vocabulary, the usefulness and effectiveness controlled vocabulary in information retrieval has been proven in specialized search systems such as the Unified Medical Language System (UMLS) (Aronson et al., 1997). Most digital libraries built for educational purposes offer a search option for using controlled vocabulary. However, keyword or full-text search has been found to be dominant (Qin & Prado, 2005). While many factors could have contributed to this search behavior, the lack of semantic relations between the controlled vocabulary and keywords puts users at a disadvantageous position because general users are not trained to think about the exact relationships, e.g., synonymy and polysemy, between the terms they are using for search and the controlled vocabulary (Dumais et al., 1990). The project reported in this paper used query log data from an educational digital library to mine semantic relationships between the user query terms and the controlled vocabulary. We used the revised hybrid approach for calculating similarity scores that involved general (domain-independent) ontological resources to process query terms and match the result with the controlled vocabulary. The performance of matching was evaluated by using community-created (Wikipedia) and expert-created (WordNet) resources. Search engines in educational digital libraries often use controlled vocabulary as one of the search options. Unfortunately, the lack of robust vocabularies and the heavy reliance on keywords in user searches caused a large number of zero hits or too low / too high number of hits for many searches (Qin and Prado, 2005). The research reported in this paper is an experiment for testing performances of three approaches used in mining user query terms for enriching the controlled vocabulary. Will the general-purpose ontology or knowledge repository be able to provide convincing coverage for the educational knowledge domain? Will general ontological entity relationships be able to produce high-quality matches between query terms and controlled vocabulary in the educational domain? GEM as an educational digital library provides access to Internet-based lesson plans, instructional units, and other educational materials. In fall 2003, by courtesy of the Gateway to Educational Materials (GEM), we collected query logs from a four-month (February, March, April, and August) period that generated 411,898 queries. We then wrote SQL programs to parse the queries in order to obtain a master list of query terms. The master list contained 1,044,043 query terms, including grade numbers and/or terms, topical keywords, book/movie titles, names for persons and organizations, geographical names, chronological expressions, and many other categories. We calculated the similarity scores between query terms and those in ERIC Thesaurus by using the built-in fuzzy grouping function in Microsoft SQL Server 2005. Description analysis of the query terms was reported in (Qin & Prado 2005). We used two sources for processing the query terms. One is the WordNet, the most frequently used structured lexical ontology developed by Princeton University (Miller 1995). Unlike conventional thesauri, WordNet is indexed by senses and contains various kinds of semantic relations such as synonymy, antonymy, hyponymy, and meronymy. The relatedness between word senses are critically important for identifying the similarities (Richardson et al., 1994), which will be elaborated in detail in later sections. Another source used for processing the query components is Wikipedia (www.wikipedia.org), a usergenerated knowledge repository. Each article in Wikipedia represents a concept (mainly name entities and domain-specific terms) and is indexed by several subject categories. The fact that all of its content is created and maintained by users from all over the world makes it a great resource for our purpose. WordNet is suitable for processing linguistic relationships for single words only. Wikipedia complements to the limitations in WordNet through offering up-to-date information in terms of current events, news, and concept changes. Compared to other encyclopedias (e.g., Encyclopedia Britannica, http://www.britannica.com), which are strictly controlled by a professional editorial team, Wikipedia has better coverage and comparable quality (Giles, 2005). A direct method used for query and controlled vocabulary (CV) match is string (fuzzy) match, namely calculating the edit distance between a query (including stop words) and controlled vocabulary, which is based on the number of edit operations, such as delete, insert and substitution of sub-strings. The similar approaches have been proposed by Gusfield (1997). Where SN (Q ∩ CV) denotes the number of common substrings shared by query terms and controlled vocabulary, while SN(Q) and SN(CV) the number of sub-strings within the query term and controlled vocabulary respectively. We used bi-gram sub-strings as the test unit. The above method is simple yet with a distinctive advantage: it takes into account the string order that differs from classical vector space model (Salton et al., 1975). For example, an existing CV term “team teaching” is different from “teaching team” even though both contain the same tokens. In other words, if Dice Coefficient has a high score (e.g. Dice Coefficient > 0.85), it implies that a high probability existed between the two strings that they are similar in both linguistic form and semantics, especially when strings are long (based on our experiment and human judgment). The Dice Coefficient, however, can be limited when there is a very long tail of low scores. The small number of controlled terms used in our experiment (364 terms) was dwarfed by the large number of unique query terms (87295). The Dice Coefficient score distribution in Figure 1 shows that, among the 87295 unique query terms, the scores less than 0.5 consisted of more than 90% of the total. The perfect and near perfect matches between query components and controlled vocabulary (Dice Coefficient scores > 0.85) counted for only less than 4% (2.54% + 0.14% + 0.51%). Compared to the long-tail percentage (90.56 %, Dice Coefficient < 0.50), the 4% “good matches” can be considered somewhat trivial. One problem in the Dice Coefficient method is that a high similarity score does not necessarily mean a good match. For example, the similarity score between “observation” and “reservation” is 0.81; however, there is almost no semantic relationship between the two words, although only few cases were identified in this scenario. To identify semantic relationships (similarity) between query terms and controlled vocabulary, we introduced two sources to process the long-tailed query terms in the dataset. As a large lexical level ontological database, WordNet is frequently used by various Natural Language Processing (NLP), Information Retrieval (IR), and Text Mining systems. WordNet can provide accurate definition of word senses and diverse relationships between words in English. To what extent does WordNet cover query terms in the educational domain? How well does WordNet perform in identifying the relatedness between query and controlled vocabulary strings in the education domain? In this experiment, we randomly selected 659 queries from the log data for evaluation purpose and manually mapped controlled vocabularies to WordNet words (as CV classes), such as “Art”, “Sport”, and “History.”. Hamlet, crossroads – (a community of people smaller than a village) Hamlet – (the hero of William Shakespeare's tragedy who hoped to avenge the murder of his father) Village, hamlet – (a settlement smaller than a town) There is no mention in the above statements that “Hamlet” is also a “film.” A similar example is “Vista,” which WordNet defines as “view, aspect, prospect, scene, vista, panorama” but none is associated with “software” or “operation system.” The lack of proper nouns and up-to-date definition in WordNet make it challenging to address the second question mentioned above. The data in Figure 2 shows that among the 659 query terms, we were able to find from WordNet exact matches for 103 single-word query terms, and additional 72 after using Porter stemmer transformation. The test demonstrated that WordNet could cover only a quarter of query terms (175, 26.56%). This result is consistent with the testing result for the large collection of 87295 query terms. To identify the relatedness between query terms and CV, we calculated the semantic relationships between query terms and CV by analyzing the paths provided by WordNet. The semantic similarity between words were represented by the hierarchically structured lexical information in WordNet, namely IS-A or hypernymy / hyponymy relations (Warin, 2004). Most taxonomy-based semantic similarity algorithms are evaluated in terms of the distance between the words or phrases (nodes on the tree structure) on the taxonomy: the shorter the path is from one node to another, the more similar they are. In the case of multiple paths, the shortest path would be the strongest relationship. In the WordNet taxonomy environment, it has been noted that the semantic “distance” covered by individual taxonomic links is variable, due to the fact that certain sub-taxonomies are much denser than others do. This intrinsic problem of the WordNet system can be minimized by adopting the maximum depth (from the lowest node to the top) in the taxonomy in which both words co-occur to normalize the similarity score. An example demonstrating the lowest common subsumer (LCS) From similarity scores calculated for all query terms against the candidate controlled vocabulary classes (provided by human expert), we found that the good matches with a high precision rate were those query terms identified from WordNet. Figure 3 presents the experiment result at three similarity levels. The percentages of correct matches suggest that semantic relationships mapped between query terms and CV classes are reliable as long as the position of target query term can be identified in the WordNet tree. When similarity scores are greater than 0.3, the precision rate of the correct match increased to 98.46%, a near perfect match. Having carefully examined the incorrect matches, we found that some popular word senses were absent from WordNet as mentioned earlier in this paper. The incorrect match example “Adobe” in Figure 3 (similarity score = 0.295) was a result of word sense presence/absence from WordNet: the available definition for adobe is “a kind of clay or brick” which led to “construction,” while the word sense for adobe as a proper name is absent. Overall, the result (Figure 3) from semantic path based match was encouraging and showed a better precision rate than that of the WordNet coverage test outcome (Figure 2). This demonstrates that even though we can only find a portion of educational query terms, the semantic matches provided by the semantic relationships in WordNet show a high level of accuracy. The result from last section demonstrates that, although WordNet was able to provide accurate matches for the query terms, there was a low rate of recall and coverage (only 25.56% of query terms matched with WordNet equivalents) and some important word senses were absent from WordNet. To increase the lexical and semantic coverage of query terms, additional sources would be needed to compensate for the limitations in WordNet. The fact that our data was collected from educational digital library query logs provides us with two advantages: first, they are concentrated on the academic/educational domain. Second, the query terms contain a large number of proper nouns. A user-created knowledge repository such as Wikipedia would satisfy the need for domain specific terms and proper names and provide closer and fuller matches for the query terms. Figure 4 compares some major characteristics between Wikipedia and WordNet. Similar to the WordNet approach, we searched against Wikipedia 484 query terms that did not match any words in WordNet. Most query terms found to have matches in WordNet were also located in Wikipedia. We were able to locate additional 194 query terms in Wikipedia, most of which were proper nouns, e.g., “three billy goats gruff (Norwegian fairy tale). Furthermore, we partially matched 168 query terms in Wikipedia, e.g., “1950's rock and roll” matched “rock and roll” entry with “1950's” as the modifier. The experiment (Figure 5) produced an encouraging result: 194 + 168 = 362 additional query terms were found in Wikipedia, which increased the coverage for the 659 query terms by 29.44% + 25.49% = 54.93%. The result proved our hypothesis that a general-purpose knowledge repository such as Wikipedia can cover most query terms in the educational domain. Proper noun: 1001 Arabian Nights Wikipedia categories: Persian mythology Arabian mythology Controversial Literature Motif of harmful sensation Epics Pederastic literature The Book of One Thousand and One Nights Arabic literature Persian literature Indian literature Proper noun: 1906 earthquake Wikipedia categories: History of the San Francisco Bay Area 1906 San Francisco earthquake The analysis result showed that the query terms had 5.66 categories (semantic tags) on average and most categories are phrases instead of a single word. However, Wikipedia cannot present reliable relationships between categories the same way as WordNet does due to its uncontrolled nature. Even though Wikipedia has a high semantic coverage for our data, we cannot calculate the similarity scores simply based on its category hierarchy. Search the first category/semantic tag from WordNet, recorded as WN-category If the category phrase does not exist in WordNet, remove the modifier and re-search Calculate the Similarity between (WN-category, CV-class) by using Lin algorithm Select the highest score CV-class, and the category will “Vote” for that Repeat the last four steps and let every category to “Vote” for its closest WN-category The highest voted CV-class will be the result For example, there were 10 categories in the query “1001 Arabian Nights” (listed above). Using the vote algorithm, each category would vote for its own CV class and the result would be “Literature” scores 5, “Music” scores 2, “Politics” scores 2, and “Software” scores 1. Therefore, the best choice of CV class is “Literature” for query “1001 Arabian Nights.” Since WordNet method had already achieved a high precision, we were focusing only on the 362 cases that did not have a match in WordNet. The percentages of correct and incorrect matches calculated by using the vote algorithm are shown in Figure 6. Compared with the result from using WordNet, this matching result was not as good as what WordNet achieved. It is still interesting, though, because Wikipedia was able to compensate for the weaknesses in WordNet. Figure 6 shows that Wikipedia achieved more than one-third correct matches among those query terms not found in WordNet. After studying the incorrect, we found that one of the reasons was that semantic tags tended to change even for the same topic. For example, “Olympic games” has different years as modifiers: The varied category names for the same topic demonstrate that there are no rules or guidelines for tagging the Wikipedia articles. Any terms or phrase structures may be used by anyone to describe the same or different concepts. If we were to further increase the rate of high-quality semantic matching, we would need to employ NLP techniques to process the article content. The educational digital library users tended to use domain-specific nouns and proper nouns with various kinds of modifiers in their queries, which made the data cleaner than those from general-purpose commercial search engines. When the controlled vocabulary had only limited terms to cover the domain knowledge, the edit distance-based match was not efficient because it would produce a very long tail with low accuracy rate (more than 96% with Dice Coefficient < 0.80). It is possible to use domain independent ontology (expert-generated) to act as the connecting bridge to match query terms against the CV. Our experiment with WordNet demonstrates that, even though there was a low coverage rate (26.56%) for the query terms in our dataset, the precision rate could be decently high (83.33% – 98.46% depending on the threshold). User generated knowledge repositories such as Wikipedia could bridge the matching between query terms and CV terms. Our experiment with Wikipedia suggested a high coverage for educational user queries (81.49%) Available in Wikipedia. By using vote algorithm that was based on user-provided subject categories, we achieved a modest accurate match result (35.64%). Studies of this kind need to consider issues both in query log mining and controlled vocabulary. The balance point between accuracy and coverage in this research slightly favored to the latter. In order to improve the matching rate, more studies with natural language processing and machine learning algorithms are needed in future. This study provides evidence for supporting the use of general-purpose ontologies and user-generated knowledge repositories to identify semantic relationships between query terms and domain-specific CV in the absence of powerful controlled vocabularies, which is summarized in Figure 2. A methodology of using domain independent ontologies to process domain specific query terms
Referência(s)