23. Charset
Module Charset
- Description
The Charset module supports a wide variety of different character sets, and it is flexible in regard of the names of character sets it accepts. The character case is ignored, as are the most common non-alaphanumeric characters appearing in character set names. E.g.
"iso-8859-1"works just as well as"ISO_8859_1". All encodings specified in RFC 1345 are supported.First of all the Charset module is capable of handling the following encodings of Unicode:
- ucs2
- ucs2be
- ucs2le
- ucs4
- ucs4be
- ucs4le
Universal Coded Character Set encodings.
- utf7
- utf8
- utf16
- utf16be
- utf16le
- utf32
- utf32be
- utf32le
- utf75
- utf7½
Unicode Transformation Format (aka UTF) encodings.
- shiftjis
- euc-kr
- euc-cn
- euc-jp
Most, if not all, of the relevant code pages are represented, as the following list shows. Prefix the numbers as noted in the list to get the wanted codec:
- 037
- 038
- 273
- 274
- 275
- 277
- 278
- 280
- 281
- 284
- 285
- 290
- 297
- 367
- 420
- 423
- 424
- 437
- 500
- 819
- 850
- 851
- 852
- 855
- 857
- 860
- 861
- 862
- 863
- 864
- 865
- 866
- 868
- 869
- 870
- 871
- 880
- 891
- 903
- 904
- 905
- 918
- 932
- 936
- 950
- 1026
These may be prefixed with
"cp","ibm"or"ms". - 1250
- 1251
- 1252
- 1253
- 1254
- 1255
- 1256
- 1257
- 1258
These may be prefixed with
"cp","ibm","ms"or"windows" - mysql-latin1
The default charset in MySQL, similar to
cp1252.
+359 more.
- Note
In Pike 7.8 and earlier this module was named
Locale.Charset.
- Method
decode_error
voiddecode_error(stringerr_str,interr_pos,stringcharset,void|stringreason,mixed...args)- Description
Throws a
DecodeErrorexception. SeeDecodeError.createfor details about the arguments. Ifargsis given then the error reason is formatted usingsprintf(.reason, @args)
- Method
decoder
Decoderdecoder(string|zeroname)- Description
Returns a charset decoder object.
- Parameter
name The name of the character set to decode from. Supported charsets include (not all supported charsets are enumerable): "iso_8859-1:1987", "iso_8859-1:1998", "iso-8859-1", "iso-ir-100", "latin1", "l1", "ansi_x3.4-1968", "iso_646.irv:1991", "iso646-us", "iso-ir-6", "us", "us-ascii", "ascii", "cp367", "ibm367", "cp819", "ibm819", "iso-2022" (of various kinds), "utf-7", "utf-8" and various encodings as described by RFC 1345.
- Throws
If the asked-for
namewas not supported, an error is thrown.
- Method
decoder_from_mib
Decoderdecoder_from_mib(intmib)- Description
Returns a decoder for the encoding schema denoted by MIB
mib.
- Method
encode_error
voidencode_error(stringerr_str,interr_pos,stringcharset,void|stringreason,mixed...args)- Description
Throws an
EncodeErrorexception. SeeEncodeError.createfor details about the arguments. Ifargsis given then the error reason is formatted usingsprintf(.reason, @args)
- Method
encoder
Encoderencoder(string|zeroname,string|voidreplacement,function(string:string)|voidrepcb)- Description
Returns a charset encoder object.
- Parameter
name The name of the character set to encode to. Supported charsets include (not all supported charsets are enumerable): "iso_8859-1:1987", "iso_8859-1:1998", "iso-8859-1", "iso-ir-100", "latin1", "l1", "ansi_x3.4-1968", "iso_646.irv:1991", "iso646-us", "iso-ir-6", "us", "us-ascii", "ascii", "cp367", "ibm367", "cp819", "ibm819", "iso-2022" (of various kinds), "utf-7", "utf-8" and various encodings as described by RFC 1345.
- Parameter
replacement The string to use for characters that cannot be represented in the charset. It's used when
repcbis not given or when it returns zero. If no replacement string is given then an error is thrown instead.- Parameter
repcb A function to call for every character that cannot be represented in the charset. If specified it's called with one argument - a string containing the character in question. If it returns a string then that one will replace the character in the output. If it returns something else then the
replacementargument will be used to decide what to do.- Throws
If the asked-for
namewas not supported, an error is thrown.
- Method
encoder_from_mib
Encoderencoder_from_mib(intmib,string|voidreplacement,function(string:string)|voidrepcb)- Description
Returns an encoder for the encoding schema denoted by MIB
mib.
- Method
normalize
string|zeronormalize(string|zeroin)- Description
All character set names are normalized through this function before compared.
- Method
set_decoder
voidset_decoder(stringname,programdecoder)- Description
Adds a custom defined character set decoder. The name is normalized through the use of
normalize.
- Method
set_encoder
voidset_encoder(stringname,programencoder)- Description
Adds a custom defined character set encoder. The name is normalized through the use of
normalize.
Class Charset.CharsetGenericError
- Description
Base class for errors thrown by the
Charsetmodule.
Class Charset.DecodeError
- Description
Error thrown when decode fails (and no replacement char or replacement callback has been registered).
- FIXME
This error class is not actually used by this module yet - decode errors are still thrown as untyped error arrays. At this point it exists only for use by other modules.
- Variable
charset
stringCharset.DecodeError.charset- Description
The decoding charset, typically as known to
Charset.decoder.- Note
Other code may produce errors of this type. In that case this name is something that
Charset.decoderdoes not accept (unless it implements exactly the same charset), and it should be reasonably certain thatCharset.decodernever accepts that name in the future (unless it is extended to implement exactly the same charset).
Class Charset.Decoder
- Description
Virtual base class for charset decoders.
Decoders take a stream of bytes and convert them to a (possibly wide) string of Unicode code points.
- Example
string win1252_to_string( string(8bit) data ) { return Charset.decoder("windows-1252")->feed( data )->drain(); }
- See also
decoder(),Encoder
- Variable
charset
stringCharset.Decoder.charset- Description
Canonical name of the charset - giving this name to
decoderreturns an instance of the same class as this object.- Note
This is not necessarily the same name that was actually given to
decoderto produce this object.
- Method
clear
this_programclear()- Description
Clear buffers, and reset all state.
- Returns
Returns the current object to allow for chaining of calls.
- Method
drain
stringdrain()- Description
Get the decoded data, and reset buffers.
- Returns
Returns the decoded string.
Class Charset.EncodeError
- Description
Error thrown when encode fails (and no replacement char or replacement callback has been registered).
- FIXME
This error class is not actually used by this module yet - encode errors are still thrown as untyped error arrays. At this point it exists only for use by other modules.
- Variable
charset
stringCharset.EncodeError.charset- Description
The encoding charset, typically as known to
Charset.encoder.- Note
Other code may produce errors of this type. In that case this name is something that
Charset.encoderdoes not accept (unless it implements exactly the same charset), and it should be reasonably certain thatCharset.encodernever accepts that name in the future (unless it is extended to implement exactly the same charset).
Class Charset.Encoder
- Description
Virtual base class for charset encoders.
Encoders take a stream of Unicode code points and converts them to a string of 8-bit bytes.
- See also
encoder(),Decoder
- Inherit
Decoder
inherit Decoder : Decoder- Description
An encoder only differs from a decoder in that it has an extra function. And in that
feed()accepts wide strings anddrain()returns only 8-bit strings.
- Variable
charset
stringCharset.Encoder.charset- Description
Canonical name of the charset - giving this name to
encoderreturns an instance of the same class as this one.- Note
This is not necessarily the same name that was actually given to
encoderto produce this object.
- Method
feed
this_programfeed(string|String.Buffers)- Description
Similar to
::feed(), but accepts wide strings.
- Method
set_replacement_callback
this_programset_replacement_callback(function(string:string)rc)- Description
Change the replacement callback function.
- Parameter
rc Function that is called to encode characters outside the current character encoding.
- Returns
Returns the current object to allow for chaining of calls.
Module Charset.Tables
Module Charset.Tables.iso88591
- Description
Codec for the ISO-8859-1 character encoding.