What are the "serious" libraries?

Joe Nelson joe at begriffs.com
Mon Apr 6 03:36:08 UTC 2020


> > https://begriffs.com/posts/2019-01-19-inside-c-standard-lib.html#stdlib.h-wchar_t

June Bug wrote:
> Noticed something in that post: "Even Mac OS and iOS use 16 bit
> wchar_t for whatever reason.”  That doesn’t seem to be true:
> 
> 	printf("%zu\n", sizeof(wchar_t));
> 
> prints 4 on macOS. However, __STDC_ISO_10646__ doesn’t seem to be
> defined.

Hm, thanks for fact checking, I confirmed it's four bytes on my mac as
well. The header /usr/include/i386/_types.h has a comment about the
choice of wchar_t representation:

/*
 * The rune type below is declared to be an ``int'' instead of the more
 * natural ``unsigned long'' or ``long''.  Two things are happening
 * here.  It is not unsigned so that EOF (-1) can be naturally assigned
 * to it and used.  Also, it looks like 10646 will be a 31 bit standard.
 * This means that if your ints cannot hold 32 bits, you will be in
 * trouble.  The reason an int was chosen over a long is that the is*()
 * and to*() routines take ints (says ANSI C), but they use
 * __darwin_ct_rune_t instead of int.  By changing it here, you lose a
 * bit of ANSI conformance, but your programs will still work.
 *
 * NOTE: rune_t is not covered by ANSI nor other standards, and should
 * not be instantiated outside of lib/libc/locale.  Use wchar_t.
 * wchar_t and rune_t must be the same type.  Also wint_t must be no
 * narrower than wchar_t, and should also be able to hold all members of
 * the largest character set plus one extra value (WEOF). wint_t must be
 * at least 16 bits.
 */

So this brings us to the next question, why isn't __STDC_ISO_10646__
defined when wchar_t is wide enough to hold UTF-32 code units? The C99
spec mentions briefly:

__STDC_ISO_10646__

	An integer constant of the form yyyymmL (for example, 199712L). If
	this symbol is defined, then every character in the Unicode required
	set, when stored in an object of type wchar_t, has the same value as
	the short identifier of that character. The Unicode required set
	consists of all the characters that are defined by ISO/IEC 10646,
	along with all amendments and technical corrigenda, as of the
	specified year and month.

I'm not familiar with the "short identifier" lingo, and I don't believe
the Unicode standard uses that term. So I downloaded the ISO 10646 spec,
and they say:

6.5 Short identifiers for code points (UIDs)

	This International Standard defines short identifiers for each code
	point, including code points that are reserved (unassigned). A short
	identifier for any code point is distinct from a short identifier
	for any other code point. If a character is allocated at a code
	point, a short identifier for that code point can be used to refer
	to the character allocated at that code point.
	[…]
	EXAMPLE
	The short identifier for LATIN SMALL LETTER LONG S may be notated in
	any of the following forms: 017F +017F U017F U+017F

It's basically just the codepoint, written with a certain syntax.

So all I can conclude is that if __STDC_ISO_10646__ is not defined, then
there is a Unicode character which is stored in wchar_t using a
different numerical value than its codepoint. Why would that be, does
anyone have insight into this?


More information about the Friends mailing list