Method Unicode.normalize()


Method normalize

string normalize(string data, string method)

Description

Normalize the given unicode string according to the specified method.

The methods are:

NFC, NFD, NFKC and NFKD.

The methods are described in detail in the UAX #15 document, which can currently be found at http://www.unicode.org/unicode/reports/tr15/tr15-21.html

A short description:

C and D specifies whether to decompose (D) complex characters to their parts, or compose (C) single characters to complex ones.

K specifies whether or not do a canonical or compatibility conversion. When K is present, compatibility transformations are performed as well as the canonical transformations.

In the following text, 'X' denotes the single character 'X', even if there is more than one character inside the quotation marks. The reson is that it's somewhat hard to describe unicode in iso-8859-1.

The Unicode Standard defines two equivalences between characters: canonical equivalence and compatibility equivalence. Canonical equivalence is a basic equivalency between characters or sequences of characters.

'Å' and 'A' '° (combining ring above)' are canonically equivalent.

For round-trip compatibility with existing standards, Unicode has encoded many entities that are really variants of existing nominal characters. The visual representations of these character are typically a subset of the possible visual representations of the nominal character. These are given compatibility decompositions in the standard. Because the characters are visually distinguished, replacing a character by a compatibility equivalent may lose formatting information unless supplemented by markup or styling.

Examples of compatibility equivalences:

  • Font variants (thin, italic, extra wide characters etc)

  • Circled and squared characters

  • super/subscript ('²' -> '2')

  • Fractions ('½' -> '1/2')

  • Other composed characters ('fi' -> 'f' 'i', 'kg' -> 'k' 'g')