Science  People  Locations  Timeline
Index: A B C D E F G H I J K L M N O P Q R S T U V W X Y Z

Home > UTF-8


 Contents
Unicode
series
Unicode
UCS
UTF-7
UTF-8
UTF-16
UTF-32
SCSU
Punycode
Bi-directional text
BOM
Han unification
Unicode and HTML

UTF-8 (8- bit Unicode Transformation Format) is a lossless, variable-length character encoding for Unicode created by Rob Pike and Ken Thompson. It uses groups of bytes to represent the Unicode standard for the alphabets of many of the world's languages. UTF-8 is especially useful for transmission over 8-bit mail systems.

It uses 1 to 4 bytes per character, depending on the Unicode symbol. For example, only one UTF-8 byte is needed to encode the 128 US-ASCII characters in the Unicode range U+0000 to U+007F.

While it may seem inefficient to represent Unicode characters with as many as 4 bytes, UTF-8 allows legacy systems to transmit this ASCII superset. Additionally, data compression can still be performed independently of the use of UTF-8.

The IETF (Internet Engineering Task Force) requires all Internet protocols to identify the encoding used for character data with UTF-8 as at least one supported encoding.

1 Description

UTF-8 is currently standardized as RFC 3629 (UTF-8, a transformation format of ISO 10646).

In summary, a Unicode character's bits are divided into several groups, which are then divided among the lower bit positions inside the UTF-8 bytes.

Characters smaller than 128dec are encoded with a single byte that contains their value: these correspond exactly to the 128 7- bit ASCII characters.

In other cases, up to 4 bytes are required. The uppermost bit of these bytes is 1, to prevent confusion with 7-bit ASCII characters. Particularly characters lower than 32dec traditionally called control characters, e.g. carriage return).

Code range
hexadecimal
UTF-16In computing, UTF-16 is a 16-bit Unicode Transformation Format, a character encoding form that provides a way to represent a series of abstract characters from Unicode and ISO/IEC 10646 as a series of 16-bit words suitable for storage or transmission via UTF-8
binaryThe binary or base-two numeral system is a system for representing numbers in which a radix of two is used; that is, each digit in a binary numeral may have either of two different values. Typically, the symbols 0 and 1 are used to represent binary number
Notes
000000 - 00007F 00000000 0xxxxxxx 0xxxxxxx ASCII equivalence range; byte begins with zero
000080 - 0007FF 00000xxx xxxxxxxx 110xxxxx 10xxxxxx first byte begins with 110 or 1110, the following byte(s) begin with 10
000800 - 00FFFF xxxxxxxx xxxxxxxx 1110xxxx 10xxxxxx 10xxxxxx
010000 - 10FFFF 110110xx xxxxxxxx
110111xx xxxxxxxx
11110xxx 10xxxxxx 10xxxxxx 10xxxxxx UTF-16 requires surrogates; an offset of 0x10000 is subtracted, so the bit pattern is not identical with UTF-8

For example, the character alef (א), which is Unicode 0x05D0, is encoded into UTF-8 in this way:

So the first 128 characters need one byte. The next 1920 characters need two bytes to encode. This includes Latin alphabet characters with diacriticsA diacritic mark or accent mark is an additional mark added to a basic letter. The word derives from Greek , distinguishing and diacritical is used to mean distinguishing or distinctive. The mark can be added over, under, or through the letter. But not al, GreekThe Greek language is written in the Greek alphabet developed in classical times (around the 9th century BC) and passed down to the present. In ancient Greece, its letters were also used to represent numbers, called Greek numerals, in analogy with Roman n, Cyrillic, Coptic, Armenian, Hebrew, and Arabic characters. The rest of the UCS-2 characters use three bytes, and additional characters are encoded in 4 bytes. (An earlier UTF-8 specification allowed even higher code points to be represented, using 5 or 6 bytes, but this is no longer supported.)

In fact, UTF-8 is able to use a sequence of up to six bytes and cover the whole area 0x00-0x7FFFFFFF (31 bits), but UTF-8 was restricted by RFC 3629 to only use the area covered by the formal Unicode definition, 0x00-0x10FFFF, in November 2003. Before this, only the bytes 0xFE and 0xFF did not occur in a UTF-8 encoded text. After this limit was introduced, the number of unused bytes in a UTF-8 stream increased to 13 bytes: 0xC0, 0xC1, 0xF5-0xFF. Even though this new definitition limits the available encoding area severely, the problem with overlong sequences (different ways of encoding the same character, which can be a security risk) is eliminated, because an overlong sequence will contain some of these bytes that are not used and therefore will not be a valid sequence.



Read more »

Non User