Joe Nelson joe at begriffs.com
Mon Apr 6 03:36:08 UTC 2020

> > https://begriffs.com/posts/2019-01-19-inside-c-standard-lib.html#stdlib.h-wchar_t

June Bug wrote:
> Noticed something in that post: "Even Mac OS and iOS use 16 bit
> wchar_t for whatever reason.”  That doesn’t seem to be true:
> 	printf("%zu\n", sizeof(wchar_t));
> prints 4 on macOS. However, __STDC_ISO_10646__ doesn’t seem to be
> defined.

Hm, thanks for fact checking, I confirmed it's four bytes on my mac as
well. The header /usr/include/i386/_types.h has a comment about the
choice of wchar_t representation:

 * The rune type below is declared to be an ``int'' instead of the more
 * natural ``unsigned long'' or ``long''.  Two things are happening
 * here.  It is not unsigned so that EOF (-1) can be naturally assigned
 * to it and used.  Also, it looks like 10646 will be a 31 bit standard.
 * This means that if your ints cannot hold 32 bits, you will be in
 * trouble.  The reason an int was chosen over a long is that the is*()
 * and to*() routines take ints (says ANSI C), but they use
 * __darwin_ct_rune_t instead of int.  By changing it here, you lose a
 * bit of ANSI conformance, but your programs will still work.
 * NOTE: rune_t is not covered by ANSI nor other standards, and should
 * not be instantiated outside of lib/libc/locale.  Use wchar_t.
 * wchar_t and rune_t must be the same type.  Also wint_t must be no
 * narrower than wchar_t, and should also be able to hold all members of
 * the largest character set plus one extra value (WEOF). wint_t must be
 * at least 16 bits.

So this brings us to the next question, why isn't __STDC_ISO_10646__
defined when wchar_t is wide enough to hold UTF-32 code units? The C99
spec mentions briefly:


	An integer constant of the form yyyymmL (for example, 199712L). If
	this symbol is defined, then every character in the Unicode required
	set, when stored in an object of type wchar_t, has the same value as
	the short identifier of that character. The Unicode required set
	consists of all the characters that are defined by ISO/IEC 10646,
	along with all amendments and technical corrigenda, as of the
	specified year and month.

I'm not familiar with the "short identifier" lingo, and I don't believe
the Unicode standard uses that term. So I downloaded the ISO 10646 spec,
and they say:

6.5 Short identifiers for code points (UIDs)

	This International Standard defines short identifiers for each code
	point, including code points that are reserved (unassigned). A short
	identifier for any code point is distinct from a short identifier
	for any other code point. If a character is allocated at a code
	point, a short identifier for that code point can be used to refer
	to the character allocated at that code point.
	The short identifier for LATIN SMALL LETTER LONG S may be notated in
	any of the following forms: 017F +017F U017F U+017F

It's basically just the codepoint, written with a certain syntax.

So all I can conclude is that if __STDC_ISO_10646__ is not defined, then
there is a Unicode character which is stored in wchar_t using a
different numerical value than its codepoint. Why would that be, does
anyone have insight into this?

