A page‐shift transformation format of ISO 10646
2001; Wiley; Volume: 32; Issue: 1 Linguagem: Inglês
10.1002/spe.427
ISSN1097-024X
Autores Tópico(s)Particle accelerators and beam dynamics
ResumoAbstract ISO 10646 Universal Character Set (UCS) or Unicode covers symbols in most of the World's written languages. There are various UCS transformation formats (UTF). UTF‐8 is compatible with systems that assume 8‐bit characters. One of the problems with UTF‐8 is its space efficiency. For files containing most Asian characters such as Han ideographs, the file sizes increase by about 50% by using UTF‐8. Although the Standard Compression Scheme for Unicode (SCSU) can compress Unicode strings to the size of a locale‐specific character set, it is complicated and is not intended to serve as a general purpose interchange format. This paper proposes a page‐shift transformation format of ISO 10646, called UTF‐S. There are four pages: 1‐byte, 2‐byte, 3‐byte and 4‐byte. Shift to page 0 uses a special code $00_{16}$ ; shift to page 1, 2, and 3 uses ISO 2022 shift codes SO, SS2, and SS3, respectively. We test several text files and compare these UTF with Big5, a locale‐specific character set. The result shows that the space efficiency of UTF‐S is better than that of UTF‐16 and UTF‐8 and is close to that of SCSU. UTF‐S is suitable for replacing locale‐specific character sets with ISO 10646 in Internet applications, such as the World Wide Web. Copyright © 2001 John Wiley & Sons, Ltd.
Referência(s)