22. Charset

Module Charset


The Charset module supports a wide variety of different character sets, and it is flexible in regard of the names of character sets it accepts. The character case is ignored, as are the most common non-alaphanumeric characters appearing in character set names. E.g. "iso-8859-1" works just as well as "ISO_8859_1". All encodings specified in RFC 1345 are supported.

First of all the Charset module is capable of handling the following encodings of Unicode:

  • utf7
  • utf8
  • utf16
  • utf16be
  • utf16le
  • utf32
  • utf32be
  • utf32le
  • utf75
  • utf7½

    UTF encodings

  • shiftjis
  • euc-kr
  • euc-cn
  • euc-jp

Most, if not all, of the relevant code pages are represented, as the following list shows. Prefix the numbers as noted in the list to get the wanted codec:

  • 037
  • 038
  • 273
  • 274
  • 275
  • 277
  • 278
  • 280
  • 281
  • 284
  • 285
  • 290
  • 297
  • 367
  • 420
  • 423
  • 424
  • 437
  • 500
  • 819
  • 850
  • 851
  • 852
  • 855
  • 857
  • 860
  • 861
  • 862
  • 863
  • 864
  • 865
  • 866
  • 868
  • 869
  • 870
  • 871
  • 880
  • 891
  • 903
  • 904
  • 905
  • 918
  • 932
  • 936
  • 950
  • 1026

    These may be prefixed with "cp", "ibm" or "ms".

  • 1250
  • 1251
  • 1252
  • 1253
  • 1254
  • 1255
  • 1256
  • 1257
  • 1258

    These may be prefixed with "cp", "ibm", "ms" or "windows"

  • mysql-latin1

    The default charset in MySQL, similar to cp1252.

+359 more.


In Pike 7.8 and earlier this module was named Locale.Charset.

Method decode_error

void decode_error(string err_str, int err_pos, string charset, void|string reason, mixed ... args)


Throws a DecodeError exception. See DecodeError.create for details about the arguments. If args is given then the error reason is formatted using sprintf(reason, @args).

Method decoder

Decoder decoder(string|zero name)


Returns a charset decoder object.

Parameter name

The name of the character set to decode from. Supported charsets include (not all supported charsets are enumerable): "iso_8859-1:1987", "iso_8859-1:1998", "iso-8859-1", "iso-ir-100", "latin1", "l1", "ansi_x3.4-1968", "iso_646.irv:1991", "iso646-us", "iso-ir-6", "us", "us-ascii", "ascii", "cp367", "ibm367", "cp819", "ibm819", "iso-2022" (of various kinds), "utf-7", "utf-8" and various encodings as described by RFC 1345.


If the asked-for name was not supported, an error is thrown.

Method decoder_from_mib

Decoder decoder_from_mib(int mib)


Returns a decoder for the encoding schema denoted by MIB mib.

Method encode_error

void encode_error(string err_str, int err_pos, string charset, void|string reason, mixed ... args)


Throws an EncodeError exception. See EncodeError.create for details about the arguments. If args is given then the error reason is formatted using sprintf(reason, @args).

Method encoder

Encoder encoder(string|zero name, string|void replacement, function(string:string)|void repcb)


Returns a charset encoder object.

Parameter name

The name of the character set to encode to. Supported charsets include (not all supported charsets are enumerable): "iso_8859-1:1987", "iso_8859-1:1998", "iso-8859-1", "iso-ir-100", "latin1", "l1", "ansi_x3.4-1968", "iso_646.irv:1991", "iso646-us", "iso-ir-6", "us", "us-ascii", "ascii", "cp367", "ibm367", "cp819", "ibm819", "iso-2022" (of various kinds), "utf-7", "utf-8" and various encodings as described by RFC 1345.

Parameter replacement

The string to use for characters that cannot be represented in the charset. It's used when repcb is not given or when it returns zero. If no replacement string is given then an error is thrown instead.

Parameter repcb

A function to call for every character that cannot be represented in the charset. If specified it's called with one argument - a string containing the character in question. If it returns a string then that one will replace the character in the output. If it returns something else then the replacement argument will be used to decide what to do.


If the asked-for name was not supported, an error is thrown.

Method encoder_from_mib

Encoder encoder_from_mib(int mib, string|void replacement, function(string:string)|void repcb)


Returns an encoder for the encoding schema denoted by MIB mib.

Method normalize

string|zero normalize(string|zero in)


All character set names are normalized through this function before compared.

Method set_decoder

void set_decoder(string name, program decoder)


Adds a custom defined character set decoder. The name is normalized through the use of normalize.

Method set_encoder

void set_encoder(string name, program encoder)


Adds a custom defined character set encoder. The name is normalized through the use of normalize.

Class Charset.CharsetGenericError


Base class for errors thrown by the Charset module.

Inherit Generic

inherit Error.Generic : Generic

Class Charset.DecodeError


Error thrown when decode fails (and no replacement char or replacement callback has been registered).


This error class is not actually used by this module yet - decode errors are still thrown as untyped error arrays. At this point it exists only for use by other modules.

Inherit CharsetGenericError

inherit CharsetGenericError : CharsetGenericError

Variable charset

string Charset.DecodeError.charset


The decoding charset, typically as known to Charset.decoder.


Other code may produce errors of this type. In that case this name is something that Charset.decoder does not accept (unless it implements exactly the same charset), and it should be reasonably certain that Charset.decoder never accepts that name in the future (unless it is extended to implement exactly the same charset).

Variable err_pos

int Charset.DecodeError.err_pos


The failing position in err_str.

Variable err_str

string Charset.DecodeError.err_str


The string that failed to be decoded.

Class Charset.Decoder


Virtual base class for charset decoders.


string win1252_to_string( string data ) { return Charset.decoder("windows-1252")->feed( data )->drain(); }

Variable charset

string Charset.Decoder.charset


Name of the charset - giving this name to decoder returns an instance of the same class as this object.


This is not necessarily the same name that was actually given to decoder to produce this object.

Method clear

this_program clear()


Clear buffers, and reset all state.


Returns the current object to allow for chaining of calls.

Method drain

string drain()


Get the decoded data, and reset buffers.


Returns the decoded string.

Method feed

this_program feed(string s)


Feeds a string to the decoder.

Parameter s

String to be decoded.


Returns the current object, to allow for chaining of calls.

Class Charset.EncodeError


Error thrown when encode fails (and no replacement char or replacement callback has been registered).


This error class is not actually used by this module yet - encode errors are still thrown as untyped error arrays. At this point it exists only for use by other modules.

Inherit CharsetGenericError

inherit CharsetGenericError : CharsetGenericError

Variable charset

string Charset.EncodeError.charset


The encoding charset, typically as known to Charset.encoder.


Other code may produce errors of this type. In that case this name is something that Charset.encoder does not accept (unless it implements exactly the same charset), and it should be reasonably certain that Charset.encoder never accepts that name in the future (unless it is extended to implement exactly the same charset).

Variable err_pos

int Charset.EncodeError.err_pos


The failing position in err_str.

Variable err_str

string Charset.EncodeError.err_str


The string that failed to be encoded.

Class Charset.Encoder


Virtual base class for charset encoders.

Inherit Decoder

inherit Decoder : Decoder


An encoder only differs from a decoder in that it has an extra function.

Variable charset

string Charset.Encoder.charset


Name of the charset - giving this name to encoder returns an instance of the same class as this one.


This is not necessarily the same name that was actually given to encoder to produce this object.

Method set_replacement_callback

this_program set_replacement_callback(function(string:string) rc)


Change the replacement callback function.

Parameter rc

Function that is called to encode characters outside the current character encoding.


Returns the current object to allow for chaining of calls.

Module Charset.Tables

Module Charset.Tables.iso88591


Codec for the ISO-8859-1 character encoding.