Artigo Acesso aberto

An Introduction to the Extensible Markup Language (XML)

1998; Association for Information Science and Technology; Volume: 25; Issue: 1 Linguagem: Inglês

10.1002/bult.104

ISSN

2163-4289

Autores

Martin Bryan,

Tópico(s)

Mathematics, Computing, and Information Processing

Resumo

© The SGML Centre, 1997 The Extensible Markup Language (XML), as specified in the World Wide Web Consortium's (W3C) Recommendation approved on February 10, 1998, is a subset of the Standard Generalized Markup Language (SGML) defined in ISO standard 8879:1986 that is designed to make it easy to interchange structured documents over the Internet. XML files always clearly mark where the start and end of each of the logical parts (called elements) of an interchanged document occurs. XML restricts the use of SGML constructs to ensure that fallback options are available when access to certain components of the document is not currently possible over the Internet. It also defines how Internet Uniform Resource Locators can be used to identify component parts of XML data streams. By defining the role of each element of text in a formal model, known as a Document Type Definition (DTD), users of XML can check that each component of a document occurs in a valid place within the interchanged data stream. An XML DTD allows computers to check, for example, that users do not accidentally enter a third-level heading without first having entered a second-level heading, something that cannot be checked using the Hypertext Markup Language (HTML) previously used to code documents that form part of the World Wide Web (WWW) of documents accessible through the Internet. However, unlike SGML, XML does not require the presence of a DTD. If no DTD is available, either because all or part of it is not accessible over the Internet or because the user failed to create it, an XML system can assign a default definition for undeclared components of the markup. XML allows users to • bring multiple files together to form compound documents • identify where illustrations are to be incorporated into text files, and the format used to encode each illustration • provide processing control information to supporting programs, such as document validators and browsers • add editorial comments to a file. It is important to note, however, that XML is not • a predefined set of tags, of the type defined for HTML, that can be used to markup documents • a standardized template for producing particular types of documents. XML was not designed to be a standardized way of coding text; in fact, it is impossible to devise a single coding scheme that would suit all languages and all applications. Instead XML is formal language that can be used to pass information about the component parts of a document to another computer system. XML is flexible enough to be able to describe any logical text structure, whether it be a form, memo, letter, report, book, encyclopedia, dictionary or database. XML is based on the concept of documents composed of a series of entities. (People familiar with modern programming techniques will probably be more comfortable using the word object, which is synonymous in this case.) Each entity can contain one or more logical elements. Each of these elements can have certain attributes (properties) that describe the way in which it is to be processed. XML provides a formal syntax for describing the relationships between the entities, elements and attributes that make up an XML document, which can be used to tell the computer how it can recognize the component parts of eaCh document. XML differs from other markup languages in that it does not simply indicate where a change of appearance occurs or where a new element starts. XML sets out to clearly identify the boundaries of every part of a document, whether it be a new chapter, a piece of boilerplate text or a reference to another publication. To allow the computer to check the structure of a document users must provide it with a document type definition that declares each of the permitted entities, elements and attributes and the relationships among them. To use a set of markup tags that has been defined by a trade association or similar body, users need to know how the markup tags are delimited from normal text and in which order the various elements should be used. Systems that understand XML can provide users with lists of the elements that are valid at each point in the document and will automatically add the required delimiters to the name to produce a markup tag. Where the data capture system does not understand XML, users can enter the XML tags manually for later validation. Elements and their attributes are entered between matched pairs of angle brackets ( ) while entity references start with an ampersand and end with a semicolon (&...;). Because XML tag sets are based on the logical structure of the document, they are somewhat easier to understand and remember than physically based markup schemes of the type typically provided by word processors. An XML memo might be coded as follows: All staff Martin Bryan 5th November Cats and Dogs Please remember to keep all cats and dogs indoors tonight. This form is ideal for a computer to follow and, therefore, to process. The start and end of each logical element of the file has been clearly identified by entry of a start-tag (e.g., ) and an end-tag (e.g., ). Notice that at this point nothing has been said about the format of the final document. From the neutral format provided by XML, users can either choose to display the memo on a screen, whose size can be varied to suit user preferences, to print the text onto a pre-printed form or to generate a completely new form, positioning each element of the document where needed. To define tag sets users must create a Document Type Definition that formally identifies the relationships among the various elements that form their documents. For a simple memo the XML DTD might take the following form: <!DOCTYPE memo [ ]> This model tells the computer that a memo consists of a sequence of header elements, , , and, optionally, , which must be followed by the content of the memo. The content of the memo defined in this simple example is a number of paragraphs, at least one of which must be present (this is indicated by the + immediately after para). In this simplified example a paragraph has been defined as a leaf node that can contain parsed character data (#PCDATA), i.e., data that has been checked to ensure that it contains no unrecognized markup strings. In a similar way the , , and elements have been declared to be leaf nodes in the document structure tree. Where the position of an element in the model is variable, the element can be defined as part of a repeatable choice of elements. For example, to allow references to books or figures to occur anywhere in the text of a paragraph, but not in the heading, the model definition for the element could be modified to read where the added elements are defined as Some elements do not require any contents as such. They are simply placeholders that indicate where a certain process is to take place. A special form of tag is used in XML to indicate empty elements that do not have any contents, and therefore have no end-tag. For example, a element is typically an empty element that acts as a placeholder for the graphical part of a figure while an optional element identifies any text associated with the illustration. Together the and make up a , which would typically be placed at the same level as a text paragraph. The following element declarations can be used to extend the model for a to allow it to include figures as well as text: Where elements can have variable forms, or need to be linked together, they can be given suitable attributes to specify the properties to be applied to them. For example, it might be decided that the field of a memo could optionally be printed in bold or italics. A suitable attribute list declaration might, in this case, be This tells the computer that the start-tag can be amended to read or if a variant font is required. If no such change is requested the program is to use the default value to make the tag read . One especially important type of attribute is the unique identifier. Because it is unique it can be used to provide a cross reference between two points in the document. For example, you can ensure that a unique identifier is assigned to each figure by adding an attribute list declaration of the following form to the DTD: This tells the computer that every element must be entered with a unique identifier within the start-tag, e.g., as rather than just . Unique identifiers can be referred to within the text by use of attributes that form identifier references. Typically a figure reference element might have its attribute declaration list defined as The keyword #IMPLIED indicates that it is permissible to omit the attribute in some instances of the element. For example, this might need to be done if the reference were to a figure in another publication. (Unique identifiers only apply to the current XML document instance – they are not necessarily unique across document sets.) XML also contains techniques for adding standard (boilerplate) text to a file and for handling characters that are outside the standard character set, but which are available on certain output devices. Commonly used text can be declared within the DTD as a text entity. A typical text entity definition could take the following form: Once such a declaration has been made in the DTD users can use an entity reference of the form &company; in place of the full name of the company. An advantage of using this technique is that, should the name of the company referred to by the mnemonic change later, only the entry in the DTD needs to be changed as the entity reference will automatically call in the current definition. Text stored in another file can also be incorporated into a file using entity references. In this case the entity declaration in the DTD identifies the location of the file containing the text to be referenced, e.g., and the entity reference (&appendix) shows where the file is to be added to the main text stream. Where non-standard characters are required special system-dependent entities can be declared to show how the characters can be generated. A typical entry might read When the string é is encountered in the text the computer will replace it by the code whose decimal value is 233. Alternatively the decimal character number, or its hexadecimal equivalent, preceded by x, can be used directly as part of a character reference, e.g., é; to generate é. XML provides a number of techniques for handling non-standard document elements. Where the coding scheme of an element of the file, such as an illustration, differs from that used for normal text the contents of the element can be treated as an entity with a special notation, e.g., Alternatively details of the relevant notation can be defined as an attribute of an element, e.g., To identify where the figure is to be positioned in the text you would either enter an entity reference such as &fig1 or an empty element such as In both these situations a notation declaration is required to tell the program what to do with the unparsed data that is contained in the referenced file. Typically this takes the form of a call to a program module, e.g., Where text, such as computer code, has been created in a form designed to be output on a line-by-line basis with the original, it can be flagged as a special type of parsed character data by addition of a special reserved attribute, xml:space, to the element declaration: where preserve means preserve the line breaks rather than use the default of replacing line breaks by spaces before justifying the contents of the element. An XML file normally consists of three types of markup, the first two of which are optional: • An XML processing instruction identifying the version of XML being used, the way in which it is encoded and whether it references other files or not, e.g., <?xml version="1.0" encoding="UCS2" standalone="yes"> • A document type declaration that either contains the formal markup declarations in its internal subset (between square brackets) or references a file containing the relevant markup declarations (the external subset), e.g., • A fully-tagged document instance which consists of a root element, whose element type name must match that assigned as the document type name in the document type declaration, within which all other markup is nested. If all three components are present, and the document instance conforms to the rules defined in the document instance, the document is said to be valid. If only the last component is present, and no formal model is present, all the XML processor can do is to check that the document instance is well-formed, i.e., that each element is properly nested within its parent elements and that each attribute is specified as an attribute name followed by a value indicator (=) and a quoted string. XML-coded files are, by their nature, ideal for storing in databases. Because XML files are both object-orientated and hierarchical in nature they can be adapted to virtually any type of database, though care sometimes needs to be taken to ensure that enough structural data is retained in the database to reconstruct the original file. A standardized interface to XML data is defined through W3C's Document Object Model (DOM), which provides a CORBA IDL interface between applications exchanging XML data. Data stored using non-XML notations will need appropriate application software to process it, but the XML-coded file will correctly identify where each piece of such data belongs in the completed document and where it has been stored prior to use. By storing data in the clearly defined format provided by XML you can ensure that your data will be transferable to a wide range of hardware and software environments. New techniques in programming and processing data will not affect the logical structure of your document's message. If more detail needs to be added to the file all you need to do is to update the model and then add new markup tags where required in the document instance. If a completely new style is required then the existing document model can be linked to the new one to provide automatic updating of document structures. For more information about the European Commission's Open Information Interchange (OII) initiative contact http://www.echo.lu/oii/en/oiistand.html For more information about the European XML/EDI Pilot Project contact http://www.cenorm.be/isss/workshop/ec/xmledi/isss-xml.html

Referência(s)