22. Charset

Module Charset

Description

The Charset module supports a wide variety of different character sets, and it is flexible in regard of the names of character sets it accepts. The character case is ignored, as are the most common non-alaphanumeric characters appearing in character set names. E.g. "iso-8859-1" works just as well as "ISO_8859_1". All encodings specified in RFC 1345 are supported.

First of all the Charset module is capable of handling the following encodings of Unicode:

  • utf7
  • utf8
  • utf16
  • utf16be
  • utf16le
  • utf32
  • utf32be
  • utf32le
  • utf75
  • utf7½

    UTF encodings

  • shiftjis
  • euc-kr
  • euc-cn
  • euc-jp

Most, if not all, of the relevant code pages are represented, as the following list shows. Prefix the numbers as noted in the list to get the wanted codec:

  • 037
  • 038
  • 273
  • 274
  • 275
  • 277
  • 278
  • 280
  • 281
  • 284
  • 285
  • 290
  • 297
  • 367
  • 420
  • 423
  • 424
  • 437
  • 500
  • 819
  • 850
  • 851
  • 852
  • 855
  • 857
  • 860
  • 861
  • 862
  • 863
  • 864
  • 865
  • 866
  • 868
  • 869
  • 870
  • 871
  • 880
  • 891
  • 903
  • 904
  • 905
  • 918
  • 932
  • 936
  • 950
  • 1026

    These may be prefixed with "cp", "ibm" or "ms".

  • 1250
  • 1251
  • 1252
  • 1253
  • 1254
  • 1255
  • 1256
  • 1257
  • 1258

    These may be prefixed with "cp", "ibm", "ms" or "windows"

  • mysql-latin1

    The default charset in MySQL, similar to cp1252.

+359 more.

Note

In Pike 7.8 and earlier this module was named Locale.Charset.


Method decode_error

void decode_error(string err_str, int err_pos, string charset, void|string reason, mixed ... args)

Description

Throws a DecodeError exception. See DecodeError.create for details about the arguments. If args is given then the error reason is formatted using sprintf(reason, @args).


Method decoder

Decoder decoder(string|zero name)

Description

Returns a charset decoder object.

Parameter name

The name of the character set to decode from. Supported charsets include (not all supported charsets are enumerable): "iso_8859-1:1987", "iso_8859-1:1998", "iso-8859-1", "iso-ir-100", "latin1", "l1", "ansi_x3.4-1968", "iso_646.irv:1991", "iso646-us", "iso-ir-6", "us", "us-ascii", "ascii", "cp367", "ibm367", "cp819", "ibm819", "iso-2022" (of various kinds), "utf-7", "utf-8" and various encodings as described by RFC 1345.

Throws

If the asked-for name was not supported, an error is thrown.


Method decoder_from_mib

Decoder decoder_from_mib(int mib)

Description

Returns a decoder for the encoding schema denoted by MIB mib.


Method encode_error

void encode_error(string err_str, int err_pos, string charset, void|string reason, mixed ... args)

Description

Throws an EncodeError exception. See EncodeError.create for details about the arguments. If args is given then the error reason is formatted using sprintf(reason, @args).


Method encoder

Encoder encoder(string|zero name, string|void replacement, function(string:string)|void repcb)

Description

Returns a charset encoder object.

Parameter name

The name of the character set to encode to. Supported charsets include (not all supported charsets are enumerable): "iso_8859-1:1987", "iso_8859-1:1998", "iso-8859-1", "iso-ir-100", "latin1", "l1", "ansi_x3.4-1968", "iso_646.irv:1991", "iso646-us", "iso-ir-6", "us", "us-ascii", "ascii", "cp367", "ibm367", "cp819", "ibm819", "iso-2022" (of various kinds), "utf-7", "utf-8" and various encodings as described by RFC 1345.

Parameter replacement

The string to use for characters that cannot be represented in the charset. It's used when repcb is not given or when it returns zero. If no replacement string is given then an error is thrown instead.

Parameter repcb

A function to call for every character that cannot be represented in the charset. If specified it's called with one argument - a string containing the character in question. If it returns a string then that one will replace the character in the output. If it returns something else then the replacement argument will be used to decide what to do.

Throws

If the asked-for name was not supported, an error is thrown.


Method encoder_from_mib

Encoder encoder_from_mib(int mib, string|void replacement, function(string:string)|void repcb)

Description

Returns an encoder for the encoding schema denoted by MIB mib.


Method normalize

string|zero normalize(string|zero in)

Description

All character set names are normalized through this function before compared.


Method set_decoder

void set_decoder(string name, program decoder)

Description

Adds a custom defined character set decoder. The name is normalized through the use of normalize.


Method set_encoder

void set_encoder(string name, program encoder)

Description

Adds a custom defined character set encoder. The name is normalized through the use of normalize.

Class Charset.CharsetGenericError

Description

Base class for errors thrown by the Charset module.


Inherit Generic

inherit Error.Generic : Generic

Class Charset.DecodeError

Description

Error thrown when decode fails (and no replacement char or replacement callback has been registered).

FIXME

This error class is not actually used by this module yet - decode errors are still thrown as untyped error arrays. At this point it exists only for use by other modules.


Inherit CharsetGenericError

inherit CharsetGenericError : CharsetGenericError


Variable charset

string Charset.DecodeError.charset

Description

The decoding charset, typically as known to Charset.decoder.

Note

Other code may produce errors of this type. In that case this name is something that Charset.decoder does not accept (unless it implements exactly the same charset), and it should be reasonably certain that Charset.decoder never accepts that name in the future (unless it is extended to implement exactly the same charset).


Variable err_pos

int Charset.DecodeError.err_pos

Description

The failing position in err_str.


Variable err_str

string Charset.DecodeError.err_str

Description

The string that failed to be decoded.

Class Charset.Decoder

Description

Virtual base class for charset decoders.

Example

string win1252_to_string( string data ) { return Charset.decoder("windows-1252")->feed( data )->drain(); }


Variable charset

string Charset.Decoder.charset

Description

Name of the charset - giving this name to decoder returns an instance of the same class as this object.

Note

This is not necessarily the same name that was actually given to decoder to produce this object.


Method clear

this_program clear()

Description

Clear buffers, and reset all state.

Returns

Returns the current object to allow for chaining of calls.


Method drain

string drain()

Description

Get the decoded data, and reset buffers.

Returns

Returns the decoded string.


Method feed

this_program feed(string s)

Description

Feeds a string to the decoder.

Parameter s

String to be decoded.

Returns

Returns the current object, to allow for chaining of calls.

Class Charset.EncodeError

Description

Error thrown when encode fails (and no replacement char or replacement callback has been registered).

FIXME

This error class is not actually used by this module yet - encode errors are still thrown as untyped error arrays. At this point it exists only for use by other modules.


Inherit CharsetGenericError

inherit CharsetGenericError : CharsetGenericError


Variable charset

string Charset.EncodeError.charset

Description

The encoding charset, typically as known to Charset.encoder.

Note

Other code may produce errors of this type. In that case this name is something that Charset.encoder does not accept (unless it implements exactly the same charset), and it should be reasonably certain that Charset.encoder never accepts that name in the future (unless it is extended to implement exactly the same charset).


Variable err_pos

int Charset.EncodeError.err_pos

Description

The failing position in err_str.


Variable err_str

string Charset.EncodeError.err_str

Description

The string that failed to be encoded.

Class Charset.Encoder

Description

Virtual base class for charset encoders.


Inherit Decoder

inherit Decoder : Decoder

Description

An encoder only differs from a decoder in that it has an extra function.


Variable charset

string Charset.Encoder.charset

Description

Name of the charset - giving this name to encoder returns an instance of the same class as this one.

Note

This is not necessarily the same name that was actually given to encoder to produce this object.


Method set_replacement_callback

this_program set_replacement_callback(function(string:string) rc)

Description

Change the replacement callback function.

Parameter rc

Function that is called to encode characters outside the current character encoding.

Returns

Returns the current object to allow for chaining of calls.

Module Charset.Tables

Module Charset.Tables.iso88591

Description

Codec for the ISO-8859-1 character encoding.