Recently I helped a friend to more fully understand the technology involved in her Library Sciences graduate program for an end-of-the-semester paper on “Cataloging South Asian Materials – Problems inherent to cataloging South Asian materials in a digital environment”.

The problems relate to how the symbols for non-ASCII scripts (alphabets, or more precisely writing systems) are represented within computer systems, and to what extent software packages (in her case library cataloging systems) support Unicode or other encoding systems capable of representing the multitude of non-ASCII symbols required for accurately cataloging “foreign” material.

Using Wikipedia, the Library of Congress, and other sources, we realized there is no relatively-plain-English explanation for character encoding written for technically-minded people who cannot dissect technical specs.

Small excerpts from her paper are paraphrased here. Used with permission. All rights reserved.

Data managed by computers must be represented in a way that can be stored by and retrieved by computers. It is helpful to imagine that all information handled by computers is represented as numeric values. Keystrokes on a keyboard produce signals that are translated by hardware into a sequence of numeric values which then can be saved or transformed. In order to share textual information between computer systems, there must be agreed-upon standards guiding which numeric values map to which character symbols (letters / numbers / punctuation / etc). In other words, computer systems must know how to encode data and subsequently decode it to display information to a user.

Everything represented by (saved in) a computer is a sequence of numeric values. To use a familiar example, a home computer user cannot save an actual photographic image inside a computer. Instead, the computer handles a digital representation of a photographic image, using a sequence of numeric values to represent the image data. Software applications that can display digital photographic images must know that a specific sequence of numeric values (an image file) should represent an image.

Additionally, such software must know how the image was encoded – whether the sequence of numeric values represents image data encoded as a JPEG, GIF, PNG, BMP, etc. Interpreting an encoded sequence of numeric values incorrectly will result in an a useless decoding. Interpreting JPEG-encoded data as though it was GIF-encoded data will not reproduce anything even close to the originally-saved image, even though both formats do represent image data.

Similarly, text entered on a keyboard produces a sequence of numeric values that represents symbols from a writing system. In order for word processing software applications to save a document and subsequently retrieve it, the numeric values representing the symbols must be encoded and decoded using an agreed-upon standard. Similar to JPEG and GIF being different formats for encoding and decoding image data, ASCII, MARC-8, and UTF-8 are examples of different formats employed by computer systems for encoding and decoding the series of numeric values representing the symbols of a writing systems. [MARC-8 is an encoding system used exclusively by library cataloging software systems.]

Writing systems of the world (alphabets) have different technological demands regarding the representation of their symbols. The Latin alphabet with twenty-six letters, in both upper and lower case, needs fifty-two different numeric values just to represent the letters. Additional numeric values are also needed in order to handle punctuation and numerals.

When computing was mostly done all in English and with the Latin alphabet, one of the most common methods for encoding keystrokes (“letters”) was ASCII (American Standard Code for Information Interchange). ASCII is a fixed 8-bit (one byte) character code which utilizes only seven bits for encoding symbols ( 2^7 = 128 possible numeric values). Those 128 numeric values are sufficient for encoding the symbols used in the Latin alphabet, including numerals and punctuation, plus certain non-printing control “characters”.

ASCII cannot accommodate other complete writing systems since it does not have enough bits to represent the Latin alphabet plus the symbols used in different writing systems. For years, most personal computers were designed for use by Americans who used the Latin alphabet. There was no need to encode other writing systems. ASCII was sufficient.

Users in other countries, though, would create their own specialized character sets, containing particular letters or accents that their writing system required. These would only be readable to another user who happens to have their computer set to display that same character set. This led to great communication difficulties in the early days of the Internet.

Additional numeric values are needed for representing other writing systems consistently inside the computer. Devanagari, for instance, used for writing Hindi, has approximately 57 discrete sound elements or modifying markers (not including numerals and punctuation) that each requires its own numeric value.

To address the need for representing non-Latin writing systems digitally, in the 1980s, programmers began work on a 16-bit “Unification Code,” or Unicode. While ASCII is able to encode 128 values (characters), 16-bit Unicode can encode any Unicode character, in 16 bit (2 byte) chunks.

Unicode is an encoding standard that provides the basis for processing, storage, and interchange of text data in any language. It is an agreed-upon standard for computer systems to interpret the sequences of numeric values that represent text.

Unicode data may 2 bytes wide. However, it may be transmitted and stored using different standards. Unicode Transformation Format (UTF) provides 8 bit (one byte), 16 bit (two byte), and 32 bit (four byte) formats. A computer system must know whether or how to group the bytes of data it receives.

UTF-8 specifies that each byte of data (8 bits) comprises a single unit of information to the computer. Unlike ASCII, which is also an 8-bit format, UTF-8 allows up to four consecutive bytes (four single units of information) to be combined and interpreted as the encoding for a single symbol. This is where the power of UTF encoding comes in.

Though up to four bytes may be used in representing a particular symbol in UTF-8, not all of the 32-bits in those four bytes are available to encode symbol values. Some of the bits are utilized for recognizing and recombining the values represented in multiple bytes.

UTF-8 does provide enough usable bits to encode over 1.1 million different numeric values – enough to represent every Unicode symbol. Symbols representing ASCII characters are always represented in one byte, while symbols in other writing systems may be represented in two-bytes, three-bytes, or four-bytes of UTF-8 formatted data.

This variable size makes UTF-8 space efficient, an important consideration for storage and transmission of data, since additional bytes are used only when a symbol requires them for encoding its numeric value. Additionally, computer systems originally designed to handle ASCII can handle UTF-8 formatted documents (as long as those documents contain only ASCII characters) since ASCII values are represented exactly the same way in UTF-8.

UTF-16 is also a variable-size format. However, 16 bits (two bytes) of data are treated as a single unit of information to the computer. Like UTF-8, UTF-16 also defines the way information may be split across each two-byte unit.

Symbols are represented in either a single two byte unit, or in consecutive pairs of two-byte units (four bytes). UTF-8 and UTF-16 can represent exactly the same number of numeric values. However, ASCII values are represented differently in UTF-16, using two bytes instead of one byte. In addition to requiring double the number of bytes, documents containing only ASCII characters that are formatted as UTF-16 are not compatible with computer systems originally designed to handle only ASCII, since ASCII is assumed to always have one-byte units, not two-byte units.

UTF-32 is a fixed-size format of 32 bits (four bytes). Since four bytes of UTF-32-encoded text always comprise a single unit of information, none of the bits in UTF-32 are needed for recombining multiple bytes into a single character value. UTF-32, therefore, is able to encode over 2 billion different values, including all the values that can be represented in UTF-8 and UTF-16. UTF-32 is useful for representing writing systems with numerous ideograms, like Chinese, Japanese, and Korean. For representing ASCII, UTF-32 will always require quadruple the number of bytes as UTF-8 would have.

Page Visits: 14804 visits since 3 May 2011
Last Visited: 23 October 2020 at 11:59pm