Artigo Acesso aberto

Web services for controlled vocabularies

2007; Association for Information Science and Technology; Volume: 32; Issue: 5 Linguagem: Inglês

10.1002/bult.2006.1720320505

ISSN

2163-4289

Autores

Diane Vizine‐Goetz, Andrew R Houghton, Eric Childress,

Tópico(s)

Digital Humanities and Scholarship

Resumo

Amid the debates about whether folksonomies will supplant controlled vocabularies and whether the Library of Congress Subject Headings (LCSH) and Dewey Decimal Classification (DDC) system have outlived their usefulness, libraries, museums and other organizations continue to require efficient, effective access to controlled vocabularies for creating consistent metadata for their collections In this article, we present an approach for using Web services to interact with controlled vocabularies. Services are implemented within a service-oriented architecture (SOA) framework. SOA is an approach to distributed computing where services are loosely coupled and discoverable on the network. A set of experimental services for controlled vocabularies is provided through the Microsoft Office (MS) Research task pane (a small window or sidebar that opens up next to Internet Explorer (IE) and other Microsoft Office applications). The research task pane is a built-in feature of IE when MS Office 2003 is loaded. The research pane enables a user to take advantage of a number of research and reference services accessible over the Internet. Web browsers, such as Mozilla Firefox and Opera, also provide sidebars which could be used to deliver similar, loosely-coupled Web services. DCMI Type Vocabulary Guidelines on Subject Access to Individual Works of Fiction, Drama, etc. (GSAFD) list of form/genre headings Library of Congress Subject Headings (LCSH) Library of Congress Annotated Card Program AC Subject Headings (LCSHac) Medical Subject Headings (MeSH) 2005 Medical Subject Headings (MeSH) 2005 Sample Medical Subject Headings (MeSH) 2006 Newspaper Genre List (NGL) Radio Form/Genre Terms Guide (RADFG) Répertoire de vedettes-matière (RVM) (access restricted) Union List of Artist Names (ULAN) Sample For the project, all of the controlled vocabularies were encoded in the MARC 21 Format for Authority Data in XML. The MARC 21 Authority Format was chosen because it enabled us to code common controlled vocabulary elements, such as preferred and non-preferred terms, term relationships, term mappings, the source of the content and the origin of changes. For some vocabularies it was first necessary to convert the controlled vocabulary data from word processing documents or HTML pages to more structured data formats and then into MARC 21. A sample term from the DCMI Type vocabulary, originally available only as HTML and Resource Description Framework (RDF), is shown in MARC 21 in XML in Figure 1. Encoding of DCMI Type value ‘Image’ in MARC 21 Format for Authority in XML Zthes 0.5 encoding of DCMI Type value ‘Image’ The DCMI Type Vocabulary is a controlled list of terms that can be used as values for the DCMI Resource Type element to identify the genre of a resource. Data field tag “040” subfield code “a” contains the MARC organization code for DCMI, the originator of the content; subfield code “c” contains the code for OCLC Research, the party responsible for conveiting the content to the MARC format. The genre term Image is coded in tag “155” and the associated genre term Still Image is coded in tag “555.” For vocabularies already available in MARC 21, the conversion to MARCXML was a relatively straightforward process. Some problems were encountered with XML and XSLT tools when processing the larger vocabularies (more than 100,000 records) especially after the files were enhanced with the vocabulary's full reference structure, term mappings and links to external Web sites. Once coded as XML the data could be used as the basis for Web services. SKOS (Simple Knowledge Organization) core, an emerging RDF schema for thesauri and related knowledge organization schemes, and the Zthes 0.5 schema, a z39.50 profile for thesaurus navigation, are also suitable formats for encoding vocabulary resources for Web services. Phase III of the High-Level Thesaurus (HILT) project is an example of a project that is using the SKOS core for encoding controlled vocabularies and classification data. MARC and Zthes formats may be added to HILT at a later stage. The Zthes 0.5 encoding for DCMI Type value Image is shown in Figure 2. The implementation of Web services support in many widely adopted platforms presents opportunities to offer terminology Web services in various modular arrangements. OCLC Research is making a set of services for controlled vocabularies available through the Microsoft Office Research task pane. To use the OCLC TS pilot vocabularies, users add OCLC services to the research pane via a URL provided to pilot participants. Within the research pane, pilot users can search a given vocabulary, display information about a term, follow links to associated terms within a vocabulary and follow links to external Web sites. Because the pilot implementation is intended to be used alongside the user's cataloging or metadata editing application, multiple copy and paste operations are provided. Users can insert controlled vocabulary terms with MARC field tags, indicators and subfield codes into MARC catalog records, or for non-MARC applications, users can insert terms as strings into their records without MARC coding. OCLC Terminology Services pilot vocabulary in research pane alongside Connexion session. Class numbers Non-preferred terms Broader terms Related terms Narrower terms Mapped terms Notes Links to external websites are displayed in the notes section of a record. The sample record in Figure 3 contains links to the MeSH online record and MeSH tree structure on the National Library of Medicine website. The register Web method is called by the research pane client to obtain information about the information provider and the services that will be offered to the client. The query Web method is called by the research pane client to obtain content from the information provider that will be displayed in the research pane content area. The Research Services Web service, item two (Figure 4, lower left), is used as a proxy to any bacftend storage technology containing controlled vocabularies, item three (upper left). For example, vocabularies can be stored as full text databases, SQL databases or XML files. The diagram depicts access to the various backend storage technologies through distributed Web service protocol technologies such as SRUAV protocol (Search/Retrieve via URL/Web service), REST (Representational State Transfer) and SOAP (Simple Object Access Protocol). For maximum flexibility, we choose to insulate access to the various backend storage technologies. This approach allows a vocabulary to reside at OCLC or another location. The Web service, item two (lower left), can access those backend storage technologies directly and/or use the distributed Web service protocol technologies. OCLC Terminology Services Pilot System Architecture OCLC's experimental implementation uses the OCLC Pears full text database software along with a Search/Retrieve Web service (SRW) interface to access the vocabularies. The terminology Web service acts as a proxy to the vocabularies providing query and markup translation along with authentication and authorization, when necessary. Our work with the Microsoft Office Research task pane explored the use of Web-based terminology services with library automation systems. We are now expanding our scope to Web-based terminology services that could interact with the Semantic Web applications. The SIMILE (Semantic Interoperability of Metadata and Information in unLike Environments) project is a Semantic Web initiative that seeks to enhance interoperability among digital assets, schemata, vocabularies, ontologies, metadata and services. The initiative is a joint project of the W3C, MIT Libraries and MIT Computer Science and Artificial Intelligence Laboratory. The SIMILE project has created an application called Piggy Bank that is a Firefox Web browser extension which allows existing information on the Web to be used in more useful and flexible ways. OCLC Research is investigating how the OCLC Pears full text database software along with its Search/Retrieve Web service (SRW) interface could be modified to interact with SIMILE'S Piggy Bank Semantic Web application. Our initial investigation for obtaining interoperability between these applications has been promising. Our focus has been on addressing issues on the provider side rather than the consumer side, that is, modifying the OCLC software and not the Piggy Bank application. Interoperability issues have arisen due to differences between metadata formats and identifiers. For the Terminology Services project, all of the controlled vocabularies were encoded in the MARC 21 Format for Authority Data using the MARCXML standard. The SIMILE project uses a different standard, the RDF-XML standard. The OCLC Pears full text database software contains an SRW service interface that is controlled by a series of Extensible Stylesheet Language (XSL) transforms. Our initial investigation revealed that, although it was possible to create an XSL transform to convert between the MARCXML and SKOS RDF-XML markup languages, the difference between the character encodings for these standards was more problematic. The resolution was to replace the existing XSLT processor (Apache Xalan XSLT 1.0) with the Saxon XSLT 2.0 processor and to create an XSL 2.0 transform that converts between the MARCXML and SKOS RDF-XML markup languages. Identifiers to the terms in controlled vocabularies are also an issue of concern. Many controlled vocabularies either do not have identifiers — the preferred term acts as the identifier — or the internal identifiers are not Web actionable URLs. An example of an identifier for a term is shown in Figure 3. It is the unique ID, D001828, associated with the MeSH heading Body Image. Although the RDF-XML standard does not require Web actionable URLs, the lack of them makes Semantic Web applications like SIMlLE's Piggy Bank less useful. The identifier issues remain under investigation and will impact the generation of the SKOS RDF-XML markup.

Referência(s)