EMERGING TECHNOLOGIES Multilingual Computing

Artigo Revisado por pares

EMERGING TECHNOLOGIES Multilingual Computing

2002; University of Hawaiʻi at Mānoa; Volume: 6; Issue: 2 Linguagem: Inglês

ISSN

1094-3501

Autores

Robert Godwin‐Jones,

Tópico(s)

Information Retrieval and Data Mining

Resumo

Language teachers, unless they teach ESL, often bemoan the use of English as the lingua franca of the Internet. The Web can be an invaluable source of authentic language use, but many Web sites world wide bypass native tongues in favor of what has become the universal Internet language. The issue is not just one of audience but also of the capability of computers to display and input text in a variety of languages. It is a software and hardware problem, but also one of world-wide standards. In this column we will look at current developments in multilingual computing -- in particular the rise of Unicode and the arrival of alternative computing devices for world languages such as India's simputer. Character Sets A computer registers and records characters as a set of numbers in binary form. Historically, character data is stored in 8 bit chunks (a bit is either a 1 or a 0) known as a byte. Personal computers, as they evolved in the United States for English language speakers used a 7-bit character code known as ASCII (American Standard Code of Information Interchange) with one bit reserved for error checking. The 7-bit ASCII encoding encompasses 128 characters, the Latin alphabet (lower and upper case), numbers, punctuation, some symbols. This was used as the basis for larger 8-bit character sets with 256 characters (sometimes referred to as ASCII) that include accented characters for West European languages. ASCII has been around since 1963 and was extended by the ISO (International Organization for Standardization) in 1967 to allow for use of character codes for non-Latin alphabet languages such as Arabic and Greek. Later, to satisfy the need for use of languages such as Russian and Hebrew, the standard called ISO 2022 was established, later expanded into ISO 8859-1 (often called ) which is widely used today for the interchange of information across the Web in Western languages. Actually, Latin-1 is one of 10 character sets, all 8-bit, defined by ISO 8859; others target eastern European languages, Turkish, Hebrew, Greek, Icelandic, and Celtic. The variety of ISO 8859 encodings is evident in the multiple character encodings which can be set in contemporary Web browsers. ASCII does a fine job for working in English, since that was what it was designed to do. Likewise, 8-bit ISO 8859 character sets are adequate for displaying most of the world's writing systems. But they are not capable of dealing with languages with many more characters such as Japanese or Chinese. What is needed for such languages is at minimum a 16 bit or two-byte system which can handle thousands of characters. Sixteen bit encoding was not used initially on personal computers not just because of monolingual shortsightedness but also for technical reasons -- early computers had very little memory and storage capacity. With the current capacity of personal computers one might ask why not simply adapt a 3-byte or even a 4-byte character set system which would supply a virtually limitless number of characters to be displayed, thus guaranteeing the encoding of any of the world's languages. The problem is that such encoding systems would tend to use many more resources than is necessary for the display of most linguistic data, thereby slowing down network transmission and making display problematic on smaller devices with less processing power and memory. Also, computer operating and networking systems were designed to handle 8-bit data chunks; keeping 8-bit systems helps transactions progress smoothly and avoids the necessity for universal system upgrades. Unicode The problem remains of how to encode all the ideographs for Asian languages as well as the alphabets for other writing systems. The solution which recently emerged is known as Unicode, which, for all practical purposes, is identical to the official ISO 10646 standard. Unicode is not the ideal solution some advocates initially envisioned, a limitless (32-bit) system in which the emphasis is on inclusiveness rather than efficiency. …

Ver no editor

Entrar

Lembrar minha senha

Receber meu e-mail de confirmação

EMERGING TECHNOLOGIES Multilingual Computing