guile18/doc/mbapi.texi

   1 \input texinfo
   2 @setfilename mbapi.info
   3 @settitle Multibyte API
   4 @setchapternewpage off
   5
   6 @c Open issues:
   7
   8 @c What's the best way to report errors?  Should functions return a
   9 @c magic value, according to C tradition, or should they signal a
  10 @c Guile exception?
  11
  12 @c
  13
  14
  15 @node Working With Multibyte Strings in C
  16 @chapter Working With Multibyte Strings in C
  17
  18 Guile allows strings to contain characters drawn from a wide variety of
  19 languages, including many Asian, Eastern European, and Middle Eastern
  20 languages, in a uniform and unrestricted way.  The string representation
  21 normally used in C code --- an array of @sc{ASCII} characters --- is not
  22 sufficient for Guile strings, since they may contain characters not
  23 present in @sc{ASCII}.
  24
  25 Instead, Guile uses a very large character set, and encodes each
  26 character as a sequence of one or more bytes.  We call this
  27 variable-width encoding a @dfn{multibyte} encoding.  Guile uses this
  28 single encoding internally for all strings, symbol names, error
  29 messages, etc., and performs appropriate conversions upon input and
  30 output.
  31
  32 The use of this variable-width encoding is almost invisible to Scheme
  33 code.  Strings are still indexed by character number, not by byte
  34 offset; @code{string-length} still returns the length of a string in
  35 characters, not in bytes.  @code{string-ref} and @code{string-set!} are
  36 no longer guaranteed to be constant-time operations, but Guile uses
  37 various strategies to reduce the impact of this change.
  38
  39 However, the encoding is visible via Guile's C interface, which gives
  40 the user direct access to a string's bytes.  This chapter explains how
  41 to work with Guile multibyte text in C code.  Since variable-width
  42 encodings are clumsier to work with than simple fixed-width encodings,
  43 Guile provides a set of standard macros and functions for manipulating
  44 multibyte text to make the job easier.  Furthermore, Guile makes some
  45 promises about the encoding which you can use in writing your own text
  46 processing code.
  47
  48 While we discuss guaranteed properties of Guile's encoding, and provide
  49 functions to operate on its character set, we do not actually specify
  50 either the character set or encoding here.  This is because we expect
  51 both of them to change in the future: currently, Guile uses the same
  52 encoding as GNU Emacs 20.4, but we hope to change Guile (and GNU Emacs
  53 as well) to use Unicode and UTF-8, with some extensions.  This will make
  54 it more comfortable to use Guile with other systems which use UTF-8,
  55 like the GTk user interface toolkit.
  56
  57 @menu
  58 * Multibyte String Terminology::
  59 * Promised Properties of the Guile Multibyte Encoding::
  60 * Functions for Operating on Multibyte Text::
  61 * Multibyte Text Processing Errors::
  62 * Why Guile Does Not Use a Fixed-Width Encoding::
  63 @end menu
  64
  65
  66 @node Multibyte String Terminology, Promised Properties of the Guile Multibyte Encoding, Working With Multibyte Strings in C, Working With Multibyte Strings in C
  67 @section Multibyte String Terminology
  68
  69 In the descriptions which follow, we make the following definitions:
  70 @table @dfn
  71
  72 @item byte
  73 A @dfn{byte} is a number between 0 and 255.  It has no inherent textual
  74 interpretation.  So 65 is a byte, not a character.
  75
  76 @item character
  77 A @dfn{character} is a unit of text.  It has no inherent numeric value.
  78 @samp{A} and @samp{.} are characters, not bytes.  (This is different
  79 from the C language's definition of @dfn{character}; in this chapter, we
  80 will always use a phrase like ``the C language's @code{char} type'' when
  81 that's what we mean.)
  82
  83 @item character set
  84 A @dfn{character set} is an invertible mapping between numbers and a
  85 given set of characters.  @sc{ASCII} is a character set assigning
  86 characters to the numbers 0 through 127.  It maps @samp{A} onto the
  87 number 65, and @samp{.} onto 46.
  88
  89 Note that a character set maps characters onto numbers, @emph{not
  90 necessarily} onto bytes.  For example, the Unicode character set maps
  91 the Greek lower-case @samp{alpha} character onto the number 945, which
  92 is not a byte.
  93
  94 (This is what Internet standards would call a "coding character set".)
  95
  96 @item encoding
  97 An encoding maps numbers onto sequences of bytes.  For example, the
  98 UTF-8 encoding, defined in the Unicode Standard, would map the number
  99 945 onto the sequence of bytes @samp{206 177}.  When using the
 100 @sc{ASCII} character set, every number assigned also happens to be a
 101 byte, so there is an obvious trivial encoding for @sc{ASCII} in bytes.
 102
 103 (This is what Internet standards would call a "character encoding
 104 scheme".)
 105
 106 @end table
 107
 108 Thus, to turn a character into a sequence of bytes, you need a character
 109 set to assign a number to that character, and then an encoding to turn
 110 that number into a sequence of bytes.
 111
 112 Likewise, to interpret a sequence of bytes as a sequence of characters,
 113 you use an encoding to extract a sequence of numbers from the bytes, and
 114 then a character set to turn the numbers into characters.
 115
 116 Errors can occur while carrying out either of these processes.  For
 117 example, under a particular encoding, a given string of bytes might not
 118 correspond to any number.  For example, the byte sequence @samp{128 128}
 119 is not a valid encoding of any number under UTF-8.
 120
 121 Having carefully defined our terminology, we will now abuse it.
 122
 123 We will sometimes use the word @dfn{character} to refer to the number
 124 assigned to a character by a character set, in contexts where it's
 125 obvious we mean a number.
 126
 127 Sometimes there is a close association between a particular encoding and
 128 a particular character set.  Thus, we may sometimes refer to the
 129 character set and encoding together as an @dfn{encoding}.
 130
 131
 132 @node Promised Properties of the Guile Multibyte Encoding, Functions for Operating on Multibyte Text, Multibyte String Terminology, Working With Multibyte Strings in C
 133 @section Promised Properties of the Guile Multibyte Encoding
 134
 135 Internally, Guile uses a single encoding for all text --- symbols,
 136 strings, error messages, etc.  Here we list a number of helpful
 137 properties of Guile's encoding.  It is correct to write code which
 138 assumes these properties; code which uses these assumptions will be
 139 portable to all future versions of Guile, as far as we know.
 140
 141 @b{Every @sc{ASCII} character is encoded as a single byte from 0 to 127, in
 142 the obvious way.}  This means that a standard C string containing only
 143 @sc{ASCII} characters is a valid Guile string (except for the terminator;
 144 Guile strings store the length explicitly, so they can contain null
 145 characters).
 146
 147 @b{The encodings of non-@sc{ASCII} characters use only bytes between 128
 148 and 255.}  That is, when we turn a non-@sc{ASCII} character into a
 149 series of bytes, none of those bytes can ever be mistaken for the
 150 encoding of an @sc{ASCII} character.  This means that you can search a
 151 Guile string for an @sc{ASCII} character using the standard
 152 @code{memchr} library function.  By extension, you can search for an
 153 @sc{ASCII} substring in a Guile string using a traditional substring
 154 search algorithm --- you needn't add special checks to verify encoding
 155 boundaries, etc.
 156
 157 @b{No character encoding is a subsequence of any other character
 158 encoding.}  (This is just a stronger version of the previous promise.)
 159 This means that you can search for occurrences of one Guile string
 160 within another Guile string just as if they were raw byte strings.  You
 161 can use the stock @code{memmem} function (provided on GNU systems, at
 162 least) for such searches.  If you don't need the ability to represent
 163 null characters in your text, you can still use null-termination for
 164 strings, and use the traditional string-handling functions like
 165 @code{strlen}, @code{strstr}, and @code{strcat}.
 166
 167 @b{You can always determine the full length of a character's encoding
 168 from its first byte.}  Guile provides the macro @code{scm_mb_len} which
 169 computes the encoding's length from its first byte.  Given the first
 170 rule, you can see that @code{scm_mb_len (@var{b})}, for any @code{0 <=
 171 @var{b} <= 127}, returns 1.
 172
 173 @b{Given an arbitrary byte position in a Guile string, you can always
 174 find the beginning and end of the character containing that byte without
 175 scanning too far in either direction.}  This means that, if you are sure
 176 a byte sequence is a valid encoding of a character sequence, you can
 177 find character boundaries without keeping track of the beginning and
 178 ending of the overall string.  This promise relies on the fact that, in
 179 addition to storing the string's length explicitly, Guile always either
 180 terminates the string's storage with a zero byte, or shares it with
 181 another string which is terminated this way.
 182
 183
 184 @node Functions for Operating on Multibyte Text, Multibyte Text Processing Errors, Promised Properties of the Guile Multibyte Encoding, Working With Multibyte Strings in C
 185 @section Functions for Operating on Multibyte Text
 186
 187 Guile provides a variety of functions, variables, and types for working
 188 with multibyte text.
 189
 190 @menu
 191 * Basic Multibyte Character Processing::
 192 * Finding Character Encoding Boundaries::
 193 * Multibyte String Functions::
 194 * Exchanging Guile Text With the Outside World in C::
 195 * Implementing Your Own Text Conversions::
 196 @end menu
 197
 198
 199 @node Basic Multibyte Character Processing, Finding Character Encoding Boundaries, Functions for Operating on Multibyte Text, Functions for Operating on Multibyte Text
 200 @subsection Basic Multibyte Character Processing
 201
 202 Here are the essential types and functions for working with Guile text.
 203 Guile uses the C type @code{unsigned char *} to refer to text encoded
 204 with Guile's encoding.
 205
 206 Note that any operation marked here as a ``Libguile Macro'' might
 207 evaluate its argument multiple times.
 208
 209 @deftp {Libguile Type} scm_char_t
 210 This is a signed integral type large enough to hold any character in
 211 Guile's character set.  All character numbers are positive.
 212 @end deftp
 213
 214 @deftypefn {Libguile Macro} scm_char_t scm_mb_get (const unsigned char *@var{p})
 215 Return the character whose encoding starts at @var{p}.  If @var{p} does
 216 not point at a valid character encoding, the behavior is undefined.
 217 @end deftypefn
 218
 219 @deftypefn {Libguile Macro} int scm_mb_put (unsigned char *@var{p}, scm_char_t @var{c})
 220 Place the encoded form of the Guile character @var{c} at @var{p}, and
 221 return its length in bytes.  If @var{c} is not a Guile character, the
 222 behavior is undefined.
 223 @end deftypefn
 224
 225 @deftypevr {Libguile Constant} int scm_mb_max_len
 226 The maximum length of any character's encoding, in bytes.  You may
 227 assume this is relatively small --- less than a dozen or so.
 228 @end deftypevr
 229
 230 @deftypefn {Libguile Macro} int scm_mb_len (unsigned char @var{b})
 231 If @var{b} is the first byte of a character's encoding, return the full
 232 length of the character's encoding, in bytes.  If @var{b} is not a valid
 233 leading byte, the behavior is undefined.
 234 @end deftypefn
 235
 236 @deftypefn {Libguile Macro} int scm_mb_char_len (scm_char_t @var{c})
 237 Return the length of the encoding of the character @var{c}, in bytes.
 238 If @var{c} is not a valid Guile character, the behavior is undefined.
 239 @end deftypefn
 240
 241 @deftypefn {Libguile Function} scm_char_t scm_mb_get_func (const unsigned char *@var{p})
 242 @deftypefnx {Libguile Function} int scm_mb_put_func (unsigned char *@var{p}, scm_char_t @var{c})
 243 @deftypefnx {Libguile Function} int scm_mb_len_func (unsigned char @var{b})
 244 @deftypefnx {Libguile Function} int scm_mb_char_len_func (scm_char_t @var{c})
 245 These are functions identical to the corresponding macros.  You can use
 246 them in situations where the overhead of a function call is acceptable,
 247 and the cleaner semantics of function application are desireable.
 248 @end deftypefn
 249
 250
 251 @node Finding Character Encoding Boundaries, Multibyte String Functions, Basic Multibyte Character Processing, Functions for Operating on Multibyte Text
 252 @subsection Finding Character Encoding Boundaries
 253
 254 These are functions for finding the boundaries between characters in
 255 multibyte text.
 256
 257 Note that any operation marked here as a ``Libguile Macro'' might
 258 evaluate its argument multiple times, unless the definition promises
 259 otherwise.
 260
 261 @deftypefn {Libguile Macro} int scm_mb_boundary_p (const unsigned char *@var{p})
 262 Return non-zero iff @var{p} points to the start of a character in
 263 multibyte text.
 264
 265 This macro will evaluate its argument only once.
 266 @end deftypefn
 267
 268 @deftypefn {Libguile Function} {const unsigned char *} scm_mb_floor (const unsigned char *@var{p})
 269 ``Round'' @var{p} to the previous character boundary.  That is, if
 270 @var{p} points to the middle of the encoding of a Guile character,
 271 return a pointer to the first byte of the encoding.  If @var{p} points
 272 to the start of the encoding of a Guile character, return @var{p}
 273 unchanged.
 274 @end deftypefn
 275
 276 @deftypefn {libguile Function} {const unsigned char *} scm_mb_ceiling (const unsigned char *@var{p})
 277 ``Round'' @var{p} to the next character boundary.  That is, if @var{p}
 278 points to the middle of the encoding of a Guile character, return a
 279 pointer to the first byte of the encoding of the next character.  If
 280 @var{p} points to the start of the encoding of a Guile character, return
 281 @var{p} unchanged.
 282 @end deftypefn
 283
 284 Note that it is usually not friendly for functions to silently correct
 285 byte offsets that point into the middle of a character's encoding.  Such
 286 offsets almost always indicate a programming error, and they should be
 287 reported as early as possible.  So, when you write code which operates
 288 on multibyte text, you should not use functions like these to ``clean
 289 up'' byte offsets which the originator believes to be correct; instead,
 290 your code should signal a @code{text:not-char-boundary} error as soon as
 291 it detects an invalid offset.  @xref{Multibyte Text Processing Errors}.
 292
 293
 294 @node Multibyte String Functions, Exchanging Guile Text With the Outside World in C, Finding Character Encoding Boundaries, Functions for Operating on Multibyte Text
 295 @subsection Multibyte String Functions
 296
 297 These functions allow you to operate on multibyte strings: sequences of
 298 character encodings.
 299
 300 @deftypefn {Libguile Function} int scm_mb_count (const unsigned char *@var{p}, int @var{len})
 301 Return the number of Guile characters encoded by the @var{len} bytes at
 302 @var{p}.
 303
 304 If the sequence contains any invalid character encodings, or ends with
 305 an incomplete character encoding, signal a @code{text:bad-encoding}
 306 error.
 307 @end deftypefn
 308
 309 @deftypefn {Libguile Macro} scm_char_t scm_mb_walk (unsigned char **@var{pp})
 310 Return the character whose encoding starts at @code{*@var{pp}}, and
 311 advance @code{*@var{pp}} to the start of the next character.  Return -1
 312 if @code{*@var{pp}} does not point to a valid character encoding.
 313 @end deftypefn
 314
 315 @deftypefn {Libguile Function} {const unsigned char *} scm_mb_prev (const unsigned char *@var{p})
 316 If @var{p} points to the middle of the encoding of a Guile character,
 317 return a pointer to the first byte of the encoding.  If @var{p} points
 318 to the start of the encoding of a Guile character, return the start of
 319 the previous character's encoding.
 320
 321 This is like @code{scm_mb_floor}, but the returned pointer will always
 322 be before @var{p}.  If you use this function to drive an iteration, it
 323 guarantees backward progress.
 324 @end deftypefn
 325
 326 @deftypefn {Libguile Function} {const unsigned char *} scm_mb_next (const unsigned char *@var{p})
 327 If @var{p} points to the encoding of a Guile character, return a pointer
 328 to the first byte of the encoding of the next character.
 329
 330 This is like @code{scm_mb_ceiling}, but the returned pointer will always
 331 be after @var{p}.  If you use this function to drive an iteration, it
 332 guarantees forward progress.
 333 @end deftypefn
 334
 335 @deftypefn {Libguile Function} {const unsigned char *} scm_mb_index (const unsigned char *@var{p}, int @var{len}, int @var{i})
 336 Assuming that the @var{len} bytes starting at @var{p} are a
 337 concatenation of valid character encodings, return a pointer to the
 338 start of the @var{i}'th character encoding in the sequence.
 339
 340 This function scans the sequence from the beginning to find the
 341 @var{i}'th character, and will generally require time proportional to
 342 the distance from @var{p} to the returned address.
 343
 344 If the sequence contains any invalid character encodings, or ends with
 345 an incomplete character encoding, signal a @code{text:bad-encoding}
 346 error.
 347 @end deftypefn
 348
 349 It is common to process the characters in a string from left to right.
 350 However, if you fetch each character using @code{scm_mb_index}, each
 351 call will scan the text from the beginning, so your loop will require
 352 time proportional to at least the square of the length of the text.  To
 353 avoid this poor performance, you can use an @code{scm_mb_cache}
 354 structure and the @code{scm_mb_index_cached} macro.
 355
 356 @deftp {Libguile Type} {struct scm_mb_cache}
 357 This structure holds information that allows a string scanning operation
 358 to use the results from a previous scan of the string.  It has the
 359 following members:
 360 @table @code
 361
 362 @item character
 363 An index, in characters, into the string.
 364
 365 @item byte
 366 The index, in bytes, of the start of that character.
 367
 368 @end table
 369
 370 In other words, @code{byte} is the byte offset of the
 371 @code{character}'th character of the string.  Note that if @code{byte}
 372 and @code{character} are equal, then all characters before that point
 373 must have encodings exactly one byte long, and the string can be indexed
 374 normally.
 375
 376 All elements of a @code{struct scm_mb_cache} structure should be
 377 initialized to zero before its first use, and whenever the string's text
 378 changes.
 379 @end deftp
 380
 381 @deftypefn {Libguile Macro} const unsigned char *scm_mb_index_cached (const unsigned char *@var{p}, int @var{len}, int @var{i}, struct scm_mb_cache *@var{cache})
 382 @deftypefnx {Libguile Function} const unsigned char *scm_mb_index_cached_func (const unsigned char *@var{p}, int @var{len}, int @var{i}, struct scm_mb_cache *@var{cache})
 383 This macro and this function are identical to @code{scm_mb_index},
 384 except that they may consult and update *@var{cache} in order to avoid
 385 scanning the string from the beginning.  @code{scm_mb_index_cached} is a
 386 macro, so it may have less overhead than
 387 @code{scm_mb_index_cached_func}, but it may evaluate its arguments more
 388 than once.
 389
 390 Using @code{scm_mb_index_cached} or @code{scm_mb_index_cached_func}, you
 391 can scan a string from left to right, or from right to left, in time
 392 proportional to the length of the string.  As long as each character
 393 fetched is less than some constant distance before or after the previous
 394 character fetched with @var{cache}, each access will require constant
 395 time.
 396 @end deftypefn
 397
 398 Guile also provides functions to convert between an encoded sequence of
 399 characters, and an array of @code{scm_char_t} objects.
 400
 401 @deftypefn {Libguile Function} scm_char_t *scm_mb_multibyte_to_fixed (const unsigned char *@var{p}, int @var{len}, int *@var{result_len})
 402 Convert the variable-width text in the @var{len} bytes at @var{p}
 403 to an array of @code{scm_char_t} values.  Return a pointer to the array,
 404 and set @code{*@var{result_len}} to the number of elements it contains.
 405 The returned array is allocated with @code{malloc}, and it is the
 406 caller's responsibility to free it.
 407
 408 If the text is not a sequence of valid character encodings, this
 409 function will signal a @code{text:bad-encoding} error.
 410 @end deftypefn
 411
 412 @deftypefn {Libguile Function} unsigned char *scm_mb_fixed_to_multibyte (const scm_char_t *@var{fixed}, int @var{len}, int *@var{result_len})
 413 Convert the array of @code{scm_char_t} values to a sequence of
 414 variable-width character encodings.  Return a pointer to the array of
 415 bytes, and set @code{*@var{result_len}} to its length, in bytes.
 416
 417 The returned byte sequence is terminated with a zero byte, which is not
 418 counted in the length returned in @code{*@var{result_len}}.
 419
 420 The returned byte sequence is allocated with @code{malloc}; it is the
 421 caller's responsibility to free it.
 422
 423 If the text is not a sequence of valid character encodings, this
 424 function will signal a @code{text:bad-encoding} error.
 425 @end deftypefn
 426
 427
 428 @node Exchanging Guile Text With the Outside World in C, Implementing Your Own Text Conversions, Multibyte String Functions, Functions for Operating on Multibyte Text
 429 @subsection Exchanging Guile Text With the Outside World in C
 430
 431 [[This is kind of a heavy-weight model, given that one end of the
 432 conversion is always going to be the Guile encoding.  Any way to shorten
 433 things a bit?]]
 434
 435 Guile provides functions for converting between Guile's internal text
 436 representation and encodings popular in the outside world.  These
 437 functions are closely modeled after the @code{iconv} functions available
 438 on some systems.
 439
 440 To convert text between two encodings, you should first call
 441 @code{scm_mb_iconv_open} to indicate the source and destination
 442 encodings; this function returns a context object which records the
 443 conversion to perform.
 444
 445 Then, you should call @code{scm_mb_iconv} to actually convert the text.
 446 This function expects input and output buffers, and a pointer to the
 447 context you got from @var{scm_mb_iconv_open}.  You don't need to pass
 448 all your input to @code{scm_mb_iconv} at once; you can invoke it on
 449 successive blocks of input (as you read it from a file, say), and it
 450 will convert as much as it can each time, indicating when you should
 451 grow your output buffer.
 452
 453 An encoding may be @dfn{stateless}, or @dfn{stateful}.  In most
 454 encodings, a contiguous group of bytes from the sequence completely
 455 specifies a particular character; these are stateless encodings.
 456 However, some encodings require you to look back an unbounded number of
 457 bytes in the stream to assign a meaning to a particular byte sequence;
 458 such encodings are stateful.
 459
 460 For example, in the @samp{ISO-2022-JP} encoding for Japanese text, the
 461 byte sequence @samp{27 36 66} indicates that subsequent bytes should be
 462 taken in pairs and interpreted as characters from the JIS-0208 character
 463 set.  An arbitrary number of byte pairs may follow this sequence.  The
 464 byte sequence @samp{27 40 66} indicates that subsequent bytes should be
 465 interpreted as @sc{ASCII}.  In this encoding, you cannot tell whether a
 466 given byte is an @sc{ASCII} character without looking back an arbitrary
 467 distance for the most recent escape sequence, so it is a stateful
 468 encoding.
 469
 470 In Guile, if a conversion involves a stateful encoding, the context
 471 object carries any necessary state.  Thus, you can have many independent
 472 conversions to or from stateful encodings taking place simultaneously,
 473 as long as each data stream uses its own context object for the
 474 conversion.
 475
 476 @deftp {Libguile Type} {struct scm_mb_iconv}
 477 This is the type for context objects, which represent the encodings and
 478 current state of an ongoing text conversion.  A @code{struct
 479 scm_mb_iconv} records the source and destination encodings, and keeps
 480 track of any information needed to handle stateful encodings.
 481 @end deftp
 482
 483 @deftypefn {Libguile Function} {struct scm_mb_iconv *} scm_mb_iconv_open (const char *@var{tocode}, const char *@var{fromcode})
 484 Return a pointer to a new @code{struct scm_mb_iconv} context object,
 485 ready to convert from the encoding named @var{fromcode} to the encoding
 486 named @var{tocode}.  For stateful encodings, the context object is in
 487 some appropriate initial state, ready for use with the
 488 @code{scm_mb_iconv} function.
 489
 490 When you are done using a context object, you may call
 491 @code{scm_mb_iconv_close} to free it.
 492
 493 If either @var{tocode} or @var{fromcode} is not the name of a known
 494 encoding, this function will signal the @code{text:unknown-conversion}
 495 error, described below.
 496
 497 @c Try to use names here from the IANA list:
 498 @c see ftp://ftp.isi.edu/in-notes/iana/assignments/character-sets
 499 Guile supports at least these encodings:
 500 @table @samp
 501
 502 @item US-ASCII
 503 @sc{US-ASCII}, in the standard one-character-per-byte encoding.
 504
 505 @item ISO-8859-1
 506 The usual character set for Western European languages, in its usual
 507 one-character-per-byte encoding.
 508
 509 @item Guile-MB
 510 Guile's current internal multibyte encoding.  The actual encoding this
 511 name refers to will change from one version of Guile to the next.  You
 512 should use this when converting data between external sources and the
 513 encoding used by Guile objects.
 514
 515 You should @emph{not} use this as the encoding for data presented to the
 516 outside world, for two reasons.  1) Its meaning will change over time,
 517 so data written using the @samp{guile} encoding with one version of
 518 Guile might not be readable with the @samp{guile} encoding in another
 519 version of Guile.  2) It currently corresponds to @samp{Emacs-Mule},
 520 which invented for Emacs's internal use, and was never intended to serve
 521 as an exchange medium.
 522
 523 @item Guile-Wide
 524 Guile's character set, as an array of @code{scm_char_t} values.
 525
 526 Note that this encoding is even less suitable for public use than
 527 @samp{Guile}, since the exact sequence of bytes depends heavily on the
 528 size and endianness the host system uses for @code{scm_char_t}.  Using
 529 this encoding is very much like calling the
 530 @code{scm_mb_multibyte_to_fixed} or @code{scm_mb_fixed_to_multibyte}
 531 functions, except that @code{scm_mb_iconv} gives you more control over
 532 buffer allocation and management.
 533
 534 @item Emacs-Mule
 535 This is the variable-length encoding for multi-lingual text by GNU
 536 Emacs, at least through version 20.4.  You probably should not use this
 537 encoding, as it is designed only for Emacs's internal use.  However, we
 538 provide it here because it's trivial to support, and some people
 539 probably do have @samp{emacs-mule}-format files lying around.
 540
 541 @end table
 542
 543 (At the moment, this list doesn't include any character sets suitable for
 544 external use that can actually handle multilingual data; this is
 545 unfortunate, as it encourages users to write data in Emacs-Mule format,
 546 which nobody but Emacs and Guile understands.  We hope to add support
 547 for Unicode in UTF-8 soon, which should solve this problem.)
 548
 549 Case is not significant in encoding names.
 550
 551 You can define your own conversions; see @ref{Implementing Your Own Text
 552 Conversions}.
 553 @end deftypefn
 554
 555 @deftypefn {Libguile Function} int scm_mb_have_encoding (const char *@var{encoding})
 556 Return a non-zero value if Guile supports the encoding named @var{encoding}[[]]
 557 @end deftypefn
 558
 559 @deftypefn {Libguile Function} size_t scm_mb_iconv (struct scm_mb_iconv *@var{context}, const char **@var{inbuf}, size_t *@var{inbytesleft}, char **@var{outbuf}, size_t *@var{outbytesleft})
 560 Convert a sequence of characters from one encoding to another.  The
 561 argument @var{context} specifies the encodings to use for the input and
 562 output, and carries state for stateful encodings; use
 563 @code{scm_mb_iconv_open} to create a @var{context} object for a
 564 particular conversion.
 565
 566 Upon entry to the function, @code{*@var{inbuf}} should point to the
 567 input buffer, and @code{*@var{inbytesleft}} should hold the number of
 568 input bytes present in the buffer; @code{*@var{outbuf}} should point to
 569 the output buffer, and @code{*@var{outbytesleft}} should hold the number
 570 of bytes available to hold the conversion results in that buffer.
 571
 572 Upon exit from the function, @code{*@var{inbuf}} points to the first
 573 unconsumed byte of input, and @code{*@var{inbytesleft}} holds the number
 574 of unconsumed input bytes; @code{*@var{outbuf}} points to the byte after
 575 the last output byte, and @code{*@var{outbyteleft}} holds the number of
 576 bytes left unused in the output buffer.
 577
 578 For stateful encodings, @var{context} carries encoding state from one
 579 call to @code{scm_mb_iconv} to the next.  Thus, successive calls to
 580 @var{scm_mb_iconv} which use the same context object can convert a
 581 stream of data one chunk at a time.
 582
 583 If @var{inbuf} is zero or @code{*@var{inbuf}} is zero, then the call is
 584 taken as a request to reset the states of the input and the output
 585 encodings.  If @var{outbuf} is non-zero and @code{*@var{outbuf}} is
 586 non-zero, then @code{scm_mb_iconv} stores a byte sequence in the output
 587 buffer to put the output encoding in its initial state.  If the output
 588 buffer is not large enough to hold this byte sequence,
 589 @code{scm_mb_iconv} returns @code{scm_mb_iconv_too_big}, and leaves
 590 the shift states of @var{context}'s input and output encodings
 591 unchanged.
 592
 593 The @code{scm_mb_iconv} function always consumes only complete
 594 characters or shift sequences from the input buffer, and the output
 595 buffer always contains a sequence of complete characters or escape
 596 sequences.
 597
 598 If the input sequence contains characters which are not expressible in
 599 the output encoding, @code{scm_mb_iconv} converts it in an
 600 implementation-defined way.  It may simply delete the character.
 601
 602 Some encodings use byte sequences which do not correspond to any textual
 603 character.  For example, the escape sequence of a stateful encoding has
 604 no textual meaning.  When converting from such an encoding, a call to
 605 @code{scm_mb_iconv} might consume input but produce no output, since the
 606 input sequence might contain only escape sequences.
 607
 608 Normally, @code{scm_mb_iconv} returns the number of input characters it
 609 could not convert perfectly to the ouput encoding.  However, it may
 610 return one of the @code{scm_mb_iconv_} codes described below, to
 611 indicate an error.  All of these codes are negative values.
 612
 613 If the input sequence contains an invalid character encoding, conversion
 614 stops before the invalid input character, and @code{scm_mb_iconv}
 615 returns the constant value @code{scm_mb_iconv_bad_encoding}.
 616
 617 If the input sequence ends with an incomplete character encoding,
 618 @code{scm_mb_iconv} will leave it in the input buffer, unconsumed, and
 619 return the constant value @code{scm_mb_iconv_incomplete_encoding}.  This
 620 is not necessarily an error, if you expect to call @code{scm_mb_iconv}
 621 again with more data which might contain the rest of the encoding
 622 fragment.
 623
 624 If the output buffer does not contain enough room to hold the converted
 625 form of the complete input text, @code{scm_mb_iconv} converts as much as
 626 it can, changes the input and output pointers to reflect the amount of
 627 text successfully converted, and then returns
 628 @code{scm_mb_iconv_too_big}.
 629 @end deftypefn
 630
 631 Here are the status codes that might be returned by @code{scm_mb_iconv}.
 632 They are all negative integers.
 633 @table @code
 634
 635 @item scm_mb_iconv_too_big
 636 The conversion needs more room in the output buffer.  Some characters
 637 may have been consumed from the input buffer, and some characters may
 638 have been placed in the available space in the output buffer.
 639
 640 @item scm_mb_iconv_bad_encoding
 641 @code{scm_mb_iconv} encountered an invalid character encoding in the
 642 input buffer.  Conversion stopped before the invalid character, so there
 643 may be some characters consumed from the input buffer, and some
 644 converted text in the output buffer.
 645
 646 @item scm_mb_iconv_incomplete_encoding
 647 The input buffer ends with an incomplete character encoding.  The
 648 incomplete encoding is left in the input buffer, unconsumed.  This is
 649 not necessarily an error, if you expect to call @code{scm_mb_iconv}
 650 again with more data which might contain the rest of the incomplete
 651 encoding.
 652
 653 @end table
 654
 655
 656 Finally, Guile provides a function for destroying conversion contexts.
 657
 658 @deftypefn {Libguile Function} void scm_mb_iconv_close (struct scm_mb_iconv *@var{context})
 659 Deallocate the conversion context object @var{context}, and all other
 660 resources allocated by the call to @code{scm_mb_iconv_open} which
 661 returned @var{context}.
 662 @end deftypefn
 663
 664
 665 @node Implementing Your Own Text Conversions,  , Exchanging Guile Text With the Outside World in C, Functions for Operating on Multibyte Text
 666 @subsection Implementing Your Own Text Conversions
 667
 668 [[note that conversions to and from Guile must produce streams
 669 containing only valid character encodings, or else Guile will crash]]
 670
 671 This section describes the interface for adding your own encoding
 672 conversions for use with @code{scm_mb_iconv}.  The interface here is
 673 borrowed from the GNOME Project's @file{libunicode} library.
 674
 675 Guile's @code{scm_mb_iconv} function works by converting the input text
 676 to a stream of @code{scm_char_t} characters, and then converting
 677 those characters to the desired output encoding.  This makes it easy
 678 for Guile to choose the appropriate conversion back ends for an
 679 arbitrary pair of input and output encodings, but it also means that the
 680 accuracy and quality of the conversions depends on the fidelity of
 681 Guile's internal character set to the source and destination encodings.
 682 Since @code{scm_mb_iconv} will be used almost exclusively for converting
 683 to and from Guile's internal character set, this shouldn't be a problem.
 684
 685 To add support for a particular encoding to Guile, you must provide one
 686 function (called the @dfn{read} function) which converts from your
 687 encoding to an array of @code{scm_char_t}'s, and another function
 688 (called the @dfn{write} function) to convert from an array of
 689 @code{scm_char_t}'s back into your encoding.  To convert from some
 690 encoding @var{a} to some other encoding @var{b}, Guile pairs up
 691 @var{a}'s read function with @var{b}'s write function.  Each call to
 692 @code{scm_mb_iconv} passes text in encoding @var{a} through the read
 693 function, to produce an array of @code{scm_char_t}'s, and then passes
 694 that array to the write function, to produce text in encoding @var{b}.
 695
 696 For stateful encodings, a read or write function can hang its own data
 697 structures off the conversion object, and provide its own functions to
 698 allocate and destroy them; this allows read and write functions to
 699 maintain whatever state they like.
 700
 701 The Guile conversion back end represents each available encoding with a
 702 @code{struct scm_mb_encoding} object.
 703
 704 @deftp {Libguile Type} {struct scm_mb_encoding}
 705 This data structure describes an encoding.  It has the following
 706 members:
 707
 708 @table @code
 709
 710 @item char **names
 711 An array of strings, giving the various names for this encoding.  The
 712 array should be terminated by a zero pointer.  Case is not significant
 713 in encoding names.
 714
 715 The @code{scm_mb_iconv_open} function searches the list of registered
 716 encodings for an encoding whose @code{names} array matches its
 717 @var{tocode} or @var{fromcode} argument.
 718
 719 @item int (*init) (void **@var{cookie})
 720 An initialization function for the encoding's private data.
 721 @code{scm_mb_iconv_open} will call this function, passing it the address
 722 of the cookie for this encoding in this context.  (We explain cookies
 723 below.)  There is no way for the @code{init} function to tell whether
 724 the encoding will be used for reading or writing.
 725
 726 Note that @code{init} receives a @emph{pointer} to the cookie, not the
 727 cookie itself.  Because the type of @var{cookie} is @code{void **}, the
 728 C compiler will not check it as carefully as it would other types.
 729
 730 The @code{init} member may be zero, indicating that no initialization is
 731 necessary for this encoding.
 732
 733 @item int (*destroy) (void **@var{cookie})
 734 A deallocation function for the encoding's private data.
 735 @code{scm_mb_iconv_close} calls this function, passing it the address of
 736 the cookie for this encoding in this context.  The @code{destroy}
 737 function should free any data the @code{init} function allocated.
 738
 739 Note that @code{destroy} receives a @emph{pointer} to the cookie, not the
 740 cookie itself.  Because the type of @var{cookie} is @code{void **}, the
 741 C compiler will not check it as carefully as it would other types.
 742
 743 The @code{destroy} member may be zero, indicating that this encoding
 744 doesn't need to perform any special action to destroy its local data.
 745
 746 @item int (*reset) (void *@var{cookie}, char **@var{outbuf}, size_t *@var{outbytesleft})
 747 Put the encoding into its initial shift state.  Guile calls this
 748 function whether the encoding is being used for input or output, so this
 749 should take appropriate steps for both directions.  If @var{outbuf} and
 750 @var{outbytesleft} are valid, the reset function should emit an escape
 751 sequence to reset the output stream to its initial state; @var{outbuf}
 752 and @var{outbytesleft} should be handled just as for
 753 @code{scm_mb_iconv}.
 754
 755 This function can return an @code{scm_mb_iconv_} error code
 756 (@pxref{Exchanging Guile Text With the Outside World in C}).  If it
 757 returns @code{scm_mb_iconv_too_big}, then the output buffer's shift
 758 state must be left unchanged.
 759
 760 Note that @code{reset} receives the cookie's value itself, not a pointer
 761 to the cookie, as the @code{init} and @code{destroy} functions do.
 762
 763 The @code{reset} member may be zero, indicating that this encoding
 764 doesn't use a shift state.
 765
 766 @item enum scm_mb_read_result (*read) (void *@var{cookie}, const char **@var{inbuf},  size_t *@var{inbytesleft}, scm_char_t **@var{outbuf}, size_t *@var{outcharsleft})
 767 Read some bytes and convert into an array of Guile characters.  This is
 768 the encoding's read function.
 769
 770 On entry, there are *@var{inbytesleft} bytes of text at *@var{inbuf} to
 771 be converted, and *@var{outcharsleft} characters available at
 772 *@var{outbuf} to hold the results.
 773
 774 On exit, *@var{inbytesleft} and *@var{inbuf} indicate the input bytes
 775 still not consumed.  *@var{outcharsleft} and *@var{outbuf} indicate the
 776 output buffer space still not filled.  (By exclusion, these indicate
 777 which input bytes were consumed, and which output characters were
 778 produced.)
 779
 780 Return one of the @code{enum scm_mb_read_result} values, described below.
 781
 782 Note that @code{read} receives the cookie's value itself, not a pointer
 783 to the cookie, as the @code{init} and @code{destroy} functions do.
 784
 785 @item enum scm_mb_write_result (*write) (void *@var{cookie}, scm_char_t **@var{inbuf}, size_t *@var{incharsleft}, **@var{outbuf}, size_t *@var{outbytesleft})
 786 Convert an array of Guile characters to output bytes.  This is
 787 the encoding's write function.
 788
 789 On entry, there are *@var{incharsleft} Guile characters available at
 790 *@var{inbuf}, and *@var{outbytesleft} bytes available to store output at
 791 *@var{outbuf}.
 792
 793 On exit, *@var{incharsleft} and *@var{inbuf} indicate the number of
 794 Guile characters left unconverted (because there was insufficient room
 795 in the output buffer to hold their converted forms), and
 796 *@var{outbytesleft} and *@var{outbuf} indicate the unused portion of the
 797 output buffer.
 798
 799 Return one of the @code{scm_mb_write_result} values, described below.
 800
 801 Note that @code{write} receives the cookie's value itself, not a pointer
 802 to the cookie, as the @code{init} and @code{destroy} functions do.
 803
 804 @item struct scm_mb_encoding *next
 805 This is used by Guile to maintain a linked list of encodings.  It is
 806 filled in when you call @code{scm_mb_register_encoding} to add your
 807 encoding to the list.
 808
 809 @end table
 810 @end deftp
 811
 812 Here is the enumerated type for the values an encoding's read function
 813 can return:
 814
 815 @deftp {Libguile Type} {enum scm_mb_read_result}
 816 This type represents the result of a call to an encoding's read
 817 function.  It has the following values:
 818
 819 @table @code
 820
 821 @item scm_mb_read_ok
 822 The read function consumed at least one byte of input.
 823
 824 @item scm_mb_read_incomplete
 825 The data present in the input buffer does not contain a complete
 826 character encoding.  No input was consumed, and no characters were
 827 produced as output.  This is not necessarily an error status, if there
 828 is more data to pass through.
 829
 830 @item scm_mb_read_error
 831 The input contains an invalid character encoding.
 832
 833 @end table
 834 @end deftp
 835
 836 Here is the enumerated type for the values an encoding's write function
 837 can return:
 838
 839 @deftp {Libguile Type} {enum scm_mb_write_result}
 840 This type represents the result of a call to an encoding's write
 841 function.  It has the following values:
 842
 843 @table @code
 844
 845 @item scm_mb_write_ok
 846 The write function was able to convert all the characters in @var{inbuf}
 847 successfully.
 848
 849 @item scm_mb_write_too_big
 850 The write function filled the output buffer, but there are still
 851 characters in @var{inbuf} left unconsumed; @var{inbuf} and
 852 @var{incharsleft} indicate the unconsumed portion of the input buffer.
 853
 854 @end table
 855 @end deftp
 856
 857
 858 Conversions to or from stateful encodings need to keep track of each
 859 encoding's current state.  Each conversion context contains two
 860 @code{void *} variables called @dfn{cookies}, one for the input
 861 encoding, and one for the output encoding.  These cookies are passed to
 862 the encodings' functions, for them to use however they please.  A
 863 stateful encoding can use its cookie to hold a pointer to some object
 864 which maintains the context's current shift state.  Stateless encodings
 865 will probably not use their cookies.
 866
 867 The cookies' lifetime is the same as that of the context object.  When
 868 the user calls @code{scm_mb_iconv_close} to destroy a context object,
 869 @code{scm_mb_iconv_close} calls the input and output encodings'
 870 @code{destroy} functions, passing them their respective cookies, so each
 871 encoding can free any data it allocated for that context.
 872
 873 Note that, if a read or write function returns a successful result code
 874 like @code{scm_mb_read_ok} or @code{scm_mb_write_ok}, then the remaining
 875 input, together with the output, must together represent the complete
 876 input text; the encoding may not store any text temporarily in its
 877 cookie.  This is because, if @code{scm_mb_iconv} returns a successful
 878 result to the user, it is correct for the user to assume that all the
 879 consumed input has been converted and placed in the output buffer.
 880 There is no ``flush'' operation to push any final results out of the
 881 encodings' buffers.
 882
 883 Here is the function you call to register a new encoding with the
 884 conversion system:
 885
 886 @deftypefn {Libguile Function} void scm_mb_register_encoding (struct scm_mb_encoding *@var{encoding})
 887 Add the encoding described by @code{*@var{encoding}} to the set
 888 understood by @code{scm_mb_iconv_open}.  Once you have registered your
 889 encoding, you can use it by calling @code{scm_mb_iconv_open} with one of
 890 the names in @code{@var{encoding}->names}.
 891 @end deftypefn
 892
 893
 894 @node Multibyte Text Processing Errors, Why Guile Does Not Use a Fixed-Width Encoding, Functions for Operating on Multibyte Text, Working With Multibyte Strings in C
 895 @section Multibyte Text Processing Errors
 896
 897 This section describes error conditions which code can signal to
 898 indicate problems encountered while processing multibyte text.  In each
 899 case, the arguments @var{message} and @var{args} are an error format
 900 string and arguments to be substituted into the string, as accepted by
 901 the @code{display-error} function.
 902
 903 @deffn Condition text:not-char-boundary func message args object offset
 904 By calling @var{func}, the program attempted to access a character at
 905 byte offset @var{offset} in the Guile object @var{object}, but
 906 @var{offset} is not the start of a character's encoding in @var{object}.
 907
 908 Typically, @var{object} is a string or symbol.  If the function signalling
 909 the error cannot find the Guile object that contains the text it is
 910 inspecting, it should use @code{#f} for @var{object}.
 911 @end deffn
 912
 913 @deffn Condition text:bad-encoding func message args object
 914 By calling @var{func}, the program attempted to interpret the text in
 915 @var{object}, but @var{object} contains a byte sequence which is not a
 916 valid encoding for any character.
 917 @end deffn
 918
 919 @deffn Condition text:not-guile-char func message args number
 920 By calling @var{func}, the program attempted to treat @var{number} as the
 921 number of a character in the Guile character set, but @var{number} does
 922 not correspond to any character in the Guile character set.
 923 @end deffn
 924
 925 @deffn Condition text:unknown-conversion func message args from to
 926 By calling @var{func}, the program attempted to convert from an encoding
 927 named @var{from} to an encoding named @var{to}, but Guile does not
 928 support such a conversion.
 929 @end deffn
 930
 931 @deftypevr {Libguile Variable} SCM scm_text_not_char_boundary
 932 @deftypevrx {Libguile Variable} SCM scm_text_bad_encoding
 933 @deftypevrx {Libguile Variable} SCM scm_text_not_guile_char
 934 These variables hold the scheme symbol objects whose names are the
 935 condition symbols above.  You can use these when signalling these
 936 errors, instead of looking them up yourself.
 937 @end deftypevr
 938
 939
 940 @node Why Guile Does Not Use a Fixed-Width Encoding,  , Multibyte Text Processing Errors, Working With Multibyte Strings in C
 941 @section Why Guile Does Not Use a Fixed-Width Encoding
 942
 943 Multibyte encodings are clumsier to work with than encodings which use a
 944 fixed number of bytes for every character.  For example, using a
 945 fixed-width encoding, we can extract the @var{i}th character of a string
 946 in constant time, and we can always substitute the @var{i}th character
 947 of a string with any other character without reallocating or copying the
 948 string.
 949
 950 However, there are no fixed-width encodings which include the characters
 951 we wish to include, and also fit in a reasonable amount of space.
 952 Despite the Unicode standard's claims to the contrary, Unicode is not
 953 really a fixed-width encoding.  Unicode uses surrogate pairs to
 954 represent characters outside the 16-bit range; a surrogate pair must be
 955 treated as a single character, but occupies two 16-bit spaces.  As of
 956 this writing, there are already plans to assign characters to the
 957 surrogate character codes.  Three- and four-byte encodings are
 958 too wasteful for a majority of Guile's users, who only need @sc{ASCII}
 959 and a few accented characters.
 960
 961 Another alternative would be to have several different fixed-width
 962 string representations, each with a different element size.  For each
 963 string, Guile would use the smallest element size capable of
 964 accomodating the string's text.  This would allow users of English and
 965 the Western European languages to use the traditional memory-efficient
 966 encodings.  However, if Guile has @var{n} string representations, then
 967 users must write @var{n} versions of any code which manipulates text
 968 directly --- one for each element size.  And if a user wants to operate
 969 on two strings simultaneously, and wants to avoid testing the string
 970 sizes within the loop, she must make @var{n}*@var{n} copies of the loop.
 971 Most users will simply not bother.  Instead, they will write code which
 972 supports only one string size, leaving us back where we started.  By
 973 using a single internal representation, Guile makes it easier for users
 974 to write multilingual code.
 975
 976 [[What about tagging each string with its encoding?
 977 "Every extension must be written to deal with every encoding"]]
 978
 979 [[You don't really want to index strings anyway.]]
 980
 981 Finally, Guile's multibyte encoding is not so bad.  Unlike a two- or
 982 four-byte encoding, it is efficient in space for American and European
 983 users.  Furthermore, the properties described above mean that many
 984 functions can be coded just as they would for a single-byte encoding;
 985 see @ref{Promised Properties of the Guile Multibyte Encoding}.
 986
 987 @bye