What are the "serious" libraries?
joe at begriffs.com
Mon Apr 6 03:36:08 UTC 2020
> > https://begriffs.com/posts/2019-01-19-inside-c-standard-lib.html#stdlib.h-wchar_t
June Bug wrote:
> Noticed something in that post: "Even Mac OS and iOS use 16 bit
> wchar_t for whatever reason.” That doesn’t seem to be true:
> printf("%zu\n", sizeof(wchar_t));
> prints 4 on macOS. However, __STDC_ISO_10646__ doesn’t seem to be
Hm, thanks for fact checking, I confirmed it's four bytes on my mac as
well. The header /usr/include/i386/_types.h has a comment about the
choice of wchar_t representation:
* The rune type below is declared to be an ``int'' instead of the more
* natural ``unsigned long'' or ``long''. Two things are happening
* here. It is not unsigned so that EOF (-1) can be naturally assigned
* to it and used. Also, it looks like 10646 will be a 31 bit standard.
* This means that if your ints cannot hold 32 bits, you will be in
* trouble. The reason an int was chosen over a long is that the is*()
* and to*() routines take ints (says ANSI C), but they use
* __darwin_ct_rune_t instead of int. By changing it here, you lose a
* bit of ANSI conformance, but your programs will still work.
* NOTE: rune_t is not covered by ANSI nor other standards, and should
* not be instantiated outside of lib/libc/locale. Use wchar_t.
* wchar_t and rune_t must be the same type. Also wint_t must be no
* narrower than wchar_t, and should also be able to hold all members of
* the largest character set plus one extra value (WEOF). wint_t must be
* at least 16 bits.
So this brings us to the next question, why isn't __STDC_ISO_10646__
defined when wchar_t is wide enough to hold UTF-32 code units? The C99
spec mentions briefly:
An integer constant of the form yyyymmL (for example, 199712L). If
this symbol is defined, then every character in the Unicode required
set, when stored in an object of type wchar_t, has the same value as
the short identifier of that character. The Unicode required set
consists of all the characters that are defined by ISO/IEC 10646,
along with all amendments and technical corrigenda, as of the
specified year and month.
I'm not familiar with the "short identifier" lingo, and I don't believe
the Unicode standard uses that term. So I downloaded the ISO 10646 spec,
and they say:
6.5 Short identifiers for code points (UIDs)
This International Standard defines short identifiers for each code
point, including code points that are reserved (unassigned). A short
identifier for any code point is distinct from a short identifier
for any other code point. If a character is allocated at a code
point, a short identifier for that code point can be used to refer
to the character allocated at that code point.
The short identifier for LATIN SMALL LETTER LONG S may be notated in
any of the following forms: 017F +017F U017F U+017F
It's basically just the codepoint, written with a certain syntax.
So all I can conclude is that if __STDC_ISO_10646__ is not defined, then
there is a Unicode character which is stored in wchar_t using a
different numerical value than its codepoint. Why would that be, does
anyone have insight into this?
More information about the Friends