22. Charset

Module Charset

Description

The Charset module supports a wide variety of different character sets, and it is flexible in regard of the names of character sets it accepts. The character case is ignored, as are the most common non-alaphanumeric characters appearing in character set names. E.g. "iso-8859-1" works just as well as "ISO_8859_1". All encodings specified in RFC 1345 are supported.

First of all the Charset module is capable of handling the following encodings of Unicode:

  • utf7
  • utf8
  • utf16
  • utf16be
  • utf16le
  • utf32
  • utf32be
  • utf32le
  • utf75
  • utf7½

    UTF encodings

  • shiftjis
  • euc-kr
  • euc-cn
  • euc-jp

Most, if not all, of the relevant code pages are represented, as the following list shows. Prefix the numbers as noted in the list to get the wanted codec:

  • 037
  • 038
  • 273
  • 274
  • 275
  • 277
  • 278
  • 280
  • 281
  • 284
  • 285
  • 290
  • 297
  • 367
  • 420
  • 423
  • 424
  • 437
  • 500
  • 819
  • 850
  • 851
  • 852
  • 855
  • 857
  • 860
  • 861
  • 862
  • 863
  • 864
  • 865
  • 866
  • 868
  • 869
  • 870
  • 871
  • 880
  • 891
  • 903
  • 904
  • 905
  • 918
  • 932
  • 936
  • 950
  • 1026

    These may be prefixed with "cp", "ibm" or "ms".

  • 1250
  • 1251
  • 1252
  • 1253
  • 1254
  • 1255
  • 1256
  • 1257
  • 1258

    These may be prefixed with "cp", "ibm", "ms" or "windows"

  • mysql-latin1

    The default charset in MySQL, similar to cp1252.

+359 more.

Note

In Pike 7.8 and earlier this module was named Locale.Charset.


Methoddecode_error

voiddecode_error(stringerr_str, interr_pos, stringcharset, void|stringreason, mixed ... args)

Description

Throws a DecodeError exception. See DecodeError.create for details about the arguments. If args is given then the error reason is formatted using sprintf(reason, @args).


Methoddecoder

Decoderdecoder(string|zeroname)

Description

Returns a charset decoder object.

Parameter name

The name of the character set to decode from. Supported charsets include (not all supported charsets are enumerable): "iso_8859-1:1987", "iso_8859-1:1998", "iso-8859-1", "iso-ir-100", "latin1", "l1", "ansi_x3.4-1968", "iso_646.irv:1991", "iso646-us", "iso-ir-6", "us", "us-ascii", "ascii", "cp367", "ibm367", "cp819", "ibm819", "iso-2022" (of various kinds), "utf-7", "utf-8" and various encodings as described by RFC 1345.

Throws

If the asked-for name was not supported, an error is thrown.


Methoddecoder_from_mib

Decoderdecoder_from_mib(intmib)

Description

Returns a decoder for the encoding schema denoted by MIB mib.


Methodencode_error

voidencode_error(stringerr_str, interr_pos, stringcharset, void|stringreason, mixed ... args)

Description

Throws an EncodeError exception. See EncodeError.create for details about the arguments. If args is given then the error reason is formatted using sprintf(reason, @args).


Methodencoder

Encoderencoder(string|zeroname, string|voidreplacement, function(string:string)|voidrepcb)

Description

Returns a charset encoder object.

Parameter name

The name of the character set to encode to. Supported charsets include (not all supported charsets are enumerable): "iso_8859-1:1987", "iso_8859-1:1998", "iso-8859-1", "iso-ir-100", "latin1", "l1", "ansi_x3.4-1968", "iso_646.irv:1991", "iso646-us", "iso-ir-6", "us", "us-ascii", "ascii", "cp367", "ibm367", "cp819", "ibm819", "iso-2022" (of various kinds), "utf-7", "utf-8" and various encodings as described by RFC 1345.

Parameter replacement

The string to use for characters that cannot be represented in the charset. It's used when repcb is not given or when it returns zero. If no replacement string is given then an error is thrown instead.

Parameter repcb

A function to call for every character that cannot be represented in the charset. If specified it's called with one argument - a string containing the character in question. If it returns a string then that one will replace the character in the output. If it returns something else then the replacement argument will be used to decide what to do.

Throws

If the asked-for name was not supported, an error is thrown.


Methodencoder_from_mib

Encoderencoder_from_mib(intmib, string|voidreplacement, function(string:string)|voidrepcb)

Description

Returns an encoder for the encoding schema denoted by MIB mib.


Methodnormalize

string|zeronormalize(string|zeroin)

Description

All character set names are normalized through this function before compared.


Methodset_decoder

voidset_decoder(stringname, programdecoder)

Description

Adds a custom defined character set decoder. The name is normalized through the use of normalize.


Methodset_encoder

voidset_encoder(stringname, programencoder)

Description

Adds a custom defined character set encoder. The name is normalized through the use of normalize.

Class Charset.CharsetGenericError

Description

Base class for errors thrown by the Charset module.


InheritGeneric

inherit Error.Generic : Generic

Class Charset.DecodeError

Description

Error thrown when decode fails (and no replacement char or replacement callback has been registered).

FIXME

This error class is not actually used by this module yet - decode errors are still thrown as untyped error arrays. At this point it exists only for use by other modules.


InheritCharsetGenericError

inherit CharsetGenericError : CharsetGenericError


Variablecharset

string Charset.DecodeError.charset

Description

The decoding charset, typically as known to Charset.decoder.

Note

Other code may produce errors of this type. In that case this name is something that Charset.decoder does not accept (unless it implements exactly the same charset), and it should be reasonably certain that Charset.decoder never accepts that name in the future (unless it is extended to implement exactly the same charset).


Variableerr_pos

int Charset.DecodeError.err_pos

Description

The failing position in err_str.


Variableerr_str

string Charset.DecodeError.err_str

Description

The string that failed to be decoded.

Class Charset.Decoder

Description

Virtual base class for charset decoders.

Example

string win1252_to_string( string data ) { return Charset.decoder("windows-1252")->feed( data )->drain(); }


Variablecharset

string Charset.Decoder.charset

Description

Name of the charset - giving this name to decoder returns an instance of the same class as this object.

Note

This is not necessarily the same name that was actually given to decoder to produce this object.


Methodclear

this_programclear()

Description

Clear buffers, and reset all state.

Returns

Returns the current object to allow for chaining of calls.


Methoddrain

stringdrain()

Description

Get the decoded data, and reset buffers.

Returns

Returns the decoded string.


Methodfeed

this_programfeed(strings)

Description

Feeds a string to the decoder.

Parameter s

String to be decoded.

Returns

Returns the current object, to allow for chaining of calls.

Class Charset.EncodeError

Description

Error thrown when encode fails (and no replacement char or replacement callback has been registered).

FIXME

This error class is not actually used by this module yet - encode errors are still thrown as untyped error arrays. At this point it exists only for use by other modules.


InheritCharsetGenericError

inherit CharsetGenericError : CharsetGenericError


Variablecharset

string Charset.EncodeError.charset

Description

The encoding charset, typically as known to Charset.encoder.

Note

Other code may produce errors of this type. In that case this name is something that Charset.encoder does not accept (unless it implements exactly the same charset), and it should be reasonably certain that Charset.encoder never accepts that name in the future (unless it is extended to implement exactly the same charset).


Variableerr_pos

int Charset.EncodeError.err_pos

Description

The failing position in err_str.


Variableerr_str

string Charset.EncodeError.err_str

Description

The string that failed to be encoded.

Class Charset.Encoder

Description

Virtual base class for charset encoders.


InheritDecoder

inherit Decoder : Decoder

Description

An encoder only differs from a decoder in that it has an extra function.


Variablecharset

string Charset.Encoder.charset

Description

Name of the charset - giving this name to encoder returns an instance of the same class as this one.

Note

This is not necessarily the same name that was actually given to encoder to produce this object.


Methodset_replacement_callback

this_programset_replacement_callback(function(string:string) rc)

Description

Change the replacement callback function.

Parameter rc

Function that is called to encode characters outside the current character encoding.

Returns

Returns the current object to allow for chaining of calls.

Module Charset.Tables

Module Charset.Tables.iso88591

Description

Codec for the ISO-8859-1 character encoding.