Unicode study

Joe Nelson joe at begriffs.com
Thu Mar 7 07:27:27 UTC 2019


Worked with Darren tonight at the Hack Factory. I was learning more
about Unicode, and wondered specifically how Javascript handles it. We
worked on finding out.  JS predates UCS-4, so their strings are arrays
of UTF-16. In fact most functions don't handle surrogate pairs. Strings
with codepoints outside the base multilingual plane cause weird results:

  var a = "0123456789";  // a.length is 10
  var b = "��������������������";  // b.length is 20

To deal with each UTF-32 codepoint, you need to explicitly break the
string into an Array type. Here's a function to turn a string into a
list of the numeric values of each codepoint:

  const codePoints = s =>
    [...s].map(x => x.codePointAt(0))

Such a list can be filtered or manipulated, but chances are you
should normalize the string into Unicode normal form NFD first
in order to decompose each combining character into its own
codepoint. String.prototype.normalize() can do this.

After manipulating the list, you can reassemble it into a UTF-16 string
like this:

  const toString = arr =>
    String.fromCodePoint(...arr)

These pieces suffice to write a Node program which can strip Emoji from
its input. Who ever thought we'd see an addition to the JS ecosystem
that *removes* Emoji? ;)

This program is a little overzealous and removes characters it
shouldn't. The numeric range has to be tightened up to match exactly the
right "Supplemental Symbols and Pictographs block."

  const inBmp = c =>
    (c < 0x10000)

  process.stdin.setEncoding('utf8');
  process.stdin.on('readable', () => {
    var str;

    while ((str = process.stdin.read()) !== null) {
      process.stdout.write(
        toString(codePoints(str).filter(inBmp))
      );
    }
  });

Anyhow, this UTF-16 business is something to think about next time
you're doing string handling in a web app.


More information about the Friends mailing list