In this issue
2007; Wiley; Volume: 59; Issue: 1 Linguagem: Inglês
10.1002/asi.20795
ISSN1532-2890
Autores Tópico(s)Data Stream Mining Techniques
ResumoIn this issue, Wolfram and Zhang demonstrate the influence of different indexing characteristics on document space density changes and document space discriminative capacity for information retrieval. Document environments that contain a relatively higher percentage of infrequently occurring terms provide lower density outcomes than do environments where a higher percentage of frequently occurring terms exist. Different indexing exhaustivity levels, however, have little influence on the document space densities. A weighting algorithm that favors higher weights for infrequently occurring terms results in the lowest overall document space densities, which allows documents to be more readily differentiated from one another. The authors also discuss the influence on outcomes using two methods of normalization of term weights (i.e., means and ranges) for the different weighting methods. Simeoni et alia propose an approach to content-based distributed information retrieval based on the periodic and incremental centralization of full-content indices of widely dispersed and autonomously managed document sources. Inspired by the success of the Open Archive Initiative's (OAI) Protocol for metadata harvesting, the approach occupies middle ground between content crawling and distributed retrieval. As in crawling, some data move toward the retrieval process, but it is statistics about the content rather than content itself; this results in more efficient use of network resources and wider scope of application. As in distributed retrieval, some processing is distributed along with the data, but it is indexing rather than retrieval; this reduces the costs of content provision while promoting the simplicity, effectiveness, and responsiveness of retrieval. The authors argue that, overall, this approach combines the desirable properties of centralized retrieval with cost-effective, large-scale resource pooling. The requirements associated with the approach are discussed, and two strategies to deploy the approach on top of the OAI infrastructure are identified. In particular, the authors define a minimal extension of the OAI protocol which supports the coordinated harvesting of full-content indices and descriptive metadata for content resources. The authors report on the implementation of a proof-of-concept prototype service for multimodal content-based retrieval of distributed file collections. Tennis and Sutton describe the development of an extension to the Simple Knowledge Organization System (SKOS) to accommodate the needs of vocabulary development applications (VDA) managing metadata schemes and requiring close tracking of change to both those schemes and their member concepts. The authors take a neopragmatic epistemic stance in asserting the need for an entity in SKOS modeling to mediate between the abstract concept and the concrete scheme. While the SKOS model sufficiently describes entities for modeling the current state of a scheme in support of indexing and search on the Semantic Web, it lacks the expressive power to serve the needs of VDA, needing to maintain scheme historical continuity. The authors demonstrate preliminarily that conceptualizations drawn from empirical work in modeling entities in the bibliographic universe, such as works, texts and exemplars, can provide the basis for SKOS extension in ways that support the more rigorous demands of capturing concept evolution in VDA. Thelwall examines whether it is possible to obtain complete lists of matching URLs from Windows Live, and whether any of its hit count estimates are robust. The article introduces two new methods to extract extra URLs from search engines: automated query splitting and automated domain and TLD searching. Both methods successfully identify additional matching URLs, but the findings suggest that there is no way to get complete lists of matching URLs or accurate hit counts from Windows Live. Some estimating suggestions are provided. Elkiss et alia describe an investigation into the ways that different authors of scientific papers describe a piece of related prior work, particularly that different citations to the same paper often focus on different aspects of that paper. The authors studied citation summaries in the context of research papers in the biomedical domain. A citation summary is the set of citing sentences for given article and can be used as a surrogate for the actual article in a variety of scenarios. This study shows that citation summaries overlap to some extent with the abstracts of the papers and that they also differ from them in that they focus on different aspects of these papers than do the abstracts. In addition, co-cited articles (which are pairs of articles cited by another article) tend to be similar. The authors show results based on a lexical similarity metric called cohesion to justify these claims. Siau and Katerattanakul present the development of an instrument to measure factors affecting information quality of personal Web portfolios. A questionnaire was administered to 307 participants from five information systems classes at two universities. Participants were asked to rate 20 concepts (for example, length of Web page, eye-catching graphics, contact information) on a scale from 0 Not Important to 6 Extremely Important. The results were used to develop the evaluative instrument. Leydesdorff explores the use of similarity measures for the normalization of author co-citation analysis. He argues that this general area of debate is further complicated when one distinguishes between the symmetrical co-citation—or, more generally, co-occurrence—matrix and the underlying asymmetrical citation—occurrence—matrix. In the Web environment, the retrieval of original citation data is often not feasible. The author argues that, in that case, the Jaccard index should be used, preferably after adding the number of total citations (i.e., occurrences) on the main diagonal. This approach is illustrated with an analysis of a highly structured data set that contains exclusively positive correlations within both groups and negative correlations between the two groups. Unlike Salton's cosine and the Pearson correlation, the Jaccard index abstracts from the shape of the distributions and focuses only on the intersection and the sum of the two sets. Since the correlations in the co-occurrence matrix may be partially spurious, this property of the Jaccard index can be considered as an advantage in this case. Klein examines the problem of processing queries with metrical constraints in XML-based information retrieval systems. The main question is how to define the distance between terms in different locations of the XML tree in an intuitively justifiable manner, without jeopardizing the ability to obtain reasonable retrieval results in terms of recall and precision. A new definition is proposed, and it usefulness is demonstrated on several examples from the INEX collection. Zhou and Chaovalit propose an ontology-supported polarity mining (OSPM) approach. The approach is designed to enhance polarity mining with ontology by providing detailed topic-specific information. OSPM was evaluated in the movie review domain using both supervised and unsupervised techniques. Results indicate that OSPM outperformed the baseline method without ontology support. Li and Wu present a new approach, People Search, to searching the Web for people who share similar interests. Their design for the matching process incorporates “person representation” in order to allow the same representation to be used when composing a query. The proposed algorithm integrates textual content and hyperlink information of all pages belonging to a personal Web site to represent a person and to match persons. The proposed algorithm is compared to algorithms that incorporate only content or link information. Precision, recall, F and Kruskal-Goodman Γ measures are used to evaluate the effectiveness of the proposed algorithm. Kuo, Li and Yang present an adaptive learning framework for Phonetic Similarity Modeling (PSM) that supports the automatic construction of transliteration lexicons. The learning algorithm starts with minimum prior knowledge about matching transliteration and acquires knowledge iteratively from the Web. The authors examine the unsupervised learning and the active learning strategies that minimize human supervision in terms of data labeling. The learning process refines the PSM and constructs a transliteration lexicon at the same time. The proposed PSM and its learning algorithm are evaluated through a series of systematic experiments, which show that the proposed framework is reliably effective on two independent databases. Bollen and Van de Sompel examine the effects of community characteristics on assessments of scholarly impact from usage. The authors define a journal Usage Impact Factor that mimics the definition of the Thomson Scientific ISI Impact Factor. Usage Impact Factor rankings were calculated on the basis of a large-scale usage dataset recorded by the linking servers of the California State University system from 2003 to 2005. The resulting journal rankings were then compared to the ISI Impact Factor. The results indicate that the particular scientific and demographic characteristics of a discipline have a strong effect on resulting usage-based assessments of scholarly impact. In particular, the data indicate that, as the number of graduate students and faculty increase in a particular discipline, Usage Impact Factor rankings will converge more strongly with the ISI Impact Factor. González-Alcaide et alia describe, in this brief communication, a thematic analysis of the descriptors included in the bibliographic records indexed in the Library and Information Science (LISA) database during the 2004–2005 period. The results indicate that, although a good deal of research continues to be of a practical and applied nature, the development of the World Wide Web, and information and communication technologies, has brought studies connected with these spheres into research activity. The importance assumed by the final user is also a prominent research topic. Arencibia-Jorge et alia illustrate the application of successive H indices in the evaluation of a scientific institution, using the researcher-department-institution hierarchy as level of aggregation. The scientific production of the Cuban National Scientific Research Center from 2001–2005, as indicated by the Web of Science, was examined. The Hirsch index (h-index) was employed to calculate the individual performance of the staff, using the g-index created by Egghe and the A-index created by Bi-Hui as complementary indicators. The successive H indices proposed by Schubert were used to determine the scientific performance of each department as well as the general performance of the institution. The possible advantages of the method for institutional performance processes are discussed.
Referência(s)