JavaScript string to Unicode (Hex)

Question

This set of questions are related to a project I've published for converting characters, or strings, to Hex based Unicode; eg...

toUnicode.fromCharacter('🍍'); //> 1f34d toUnicode.fromString('Spam!', '0x'); //> ['0x53', '0x70', '0x61', '0x6d', '0x21']

Questions

Are there any mistakes, such as unaccounted for edge-cases?
Have I missed any test cases?
Any suggestions on making the code more readable and/or easier to extend?
Are there any features that are wanted?

Setup and Usage

The Source code is maintained on GitHub and may be cloned via the following commands. A Live demo is hosted online, thanks to GitHub Pages.

mkdir -vp ~/git/hub/javascript-utilities cd ~/git/hub/javascript-utilities git clone [email protected]:javascript-utilities/to-unicode.git

The build target is ECMAScript version 6, and so far both manual tests and automated JestJS tests show that the toUnicode methods function as intended; for both Browser and NodeJS environments.

Example NodeJS Usage

const toUnicode = require('./to-unicode.js'); var panda_code = toUnicode.fromCharacter('🐼'); console.log(panda_code); //> '1f43c'

Source Code

I am concerned with improving the JavaScript, and TypeScript; ie. HTML is intended to be simple and functional.

'use strict'; /** * Namespace for static methods that convert characters and strings to Unicode */ class toUnicode { /** * Converts character to Hex Unicode * @param {string} character * @return {string} * @author S0AndS0 * @copyright AGPL-3.0 * @example * toUnicode.fromCharacter('🐼'); * //> "1f43c" */ static fromCharacter(character) { return character.codePointAt(undefined).toString(16); } /** * Converts string to character array of Unicode(s) * @param {string} characters * @return {string[]} * @author S0AndS0 * @copyright AGPL-3.0 * @example * toUnicode.fromString('🎉 👋'); * //> [ '1f389', '20', '1f44b' ] */ static fromString(characters, prefix = '') { return [...characters].reduce((accumulator, character) => { const unicode = toUnicode.fromCharacter(character); accumulator.push(`${prefix}${unicode}`); return accumulator; }, []); } } /* istanbul ignore next */ if (typeof module !== 'undefined') { module.exports = toUnicode; }

<!DOCTYPE html> <html lang="en" dir="ltr"> <head> <meta charset="utf-8"> <title>toUnicode Usage Example</title> <script type="text/javascript" src="assets/js/modules/to-unicode.js" differ></script> <script type="text/javascript" differ> const text_input__callback = (_event) => { const client_input = document.getElementById('client__text--input').value; const client_prefix = document.getElementById('client__text--prefix').value; const output_element = document.getElementById('client__text--output'); const unicode_list = toUnicode.fromString(client_input, client_prefix); console.log(unicode_list); output_element.innerText = unicode_list.join('\n'); }; window.addEventListener('load', () => { const client_text_input = document.getElementById('client__text--input'); const client_text_prefix = document.getElementById('client__text--prefix'); client_text_input.addEventListener('input', text_input__callback); client_text_prefix.addEventListener('input', text_input__callback); }); </script> </head> <body> <span>Prefix: </span> <input type="text" id="client__text--prefix" value="0x"> <br> <span>Input: </span> <input type="text" id="client__text--input" value=""> <pre id="client__text--output"></pre> </body> </html>

JestJS Tests

For completeness here are the JestJS tests.

'use strict'; /** * Tests modules within `to-unicode.js` script * @author S0AndS0 * @copyright AGPL-3.0 */ class toUnicode_Test { constructor(min_code_point = 161, max_code_point = 1114111) { this.toUnicode = require('../to-unicode.js'); this.min_code_point = min_code_point; this.max_code_point = max_code_point; } randomCodePoint() { return Math.random() * (this.max_code_point - this.min_code_point + 1) + this.min_code_point | 0; } runTests() { this.testInvariance(); } /** * Tests if `fromCharacter()` and `fromString()` functions are reversible. */ testInvariance() { const character_code_list = Array(99).fill(0).map((_) => { return this.randomCodePoint(); }); let unicode_list = []; let characters_string = ''; test('Is `fromCharacter()` reversible?', () => { character_code_list.forEach((code_point) => { const character = String.fromCodePoint(code_point); const unicode = this.toUnicode.fromCharacter(character); const decimal = Number(`0x${unicode}`); expect(decimal).toEqual(code_point); unicode_list.push(unicode); characters_string += character; }); }); test('Is `fromString()` reversible?', () => { expect(this.toUnicode.fromString(characters_string)).toStrictEqual(unicode_list); }); } } const test_toUnicode = new toUnicode_Test(); test_toUnicode.runTests();

adiga · Accepted Answer · 2020-08-07 08:52:02Z

This is regarding the edge-cases and test cases mentioned in the question:

[...characters] // or Array.from(characters)

handles splitting the characters of string to an array in most of the cases. It is better than characters.split("") because it handles surrogate pairs pretty well.

console.log( "🍍".length ) // 2 console.log( [..."🍍"] ) console.log( "🍍".split("") )

What if you input something like 👨‍👩‍👧‍👦? You get

0x1f468 0x200d 0x1f469 0x200d 0x1f467 0x200d 0x1f466

and not a single hex as output like before. Because, there are hundreds of emoji sequences which are combinations of multiple emojis but display as a single emoji. They are joined with a Zero Width Joiner (U+200D) character. When you use [...] on them, they are split into an array of individual emojis and joiner characters.

console.log("👁️‍🗨️".length) // 7 console.log(Array.from("👁️‍🗨️")) console.log("👨‍👩‍👧‍👦".length) // 11 console.log(Array.from("👨‍👩‍👧‍👦"))

Similarly, many languages create a grapheme or a symbol with combining marks. They look like distinctive units of writing, but are made up of multiple unicode points.

The below strings are not the same. The first string has á but the second string is a and a combining mark U+0301

const a = "álgebra", b = "álgebra" console.log(a === b) // false console.log(a.length, b.length) console.log([...a].join(" , ")) console.log([...b].join(" , ")) console.log([..."हिन्दी"].join(" , ")) // Devanagari script

ि is a vowel sound and isn't used on it's own. It needs to be combined to with a consonant like ह(Ha) to get हि (He)

You can create big strings using multiple combining marks while the string looks like it has 6 distinct characters:

const a = 'Z͑ͫ̓ͪ̂ͫ̽͏̴̙̤̞͉͚̯̞̠͍A̴̵̜̰͔ͫ͗͢L̠ͨͧͩ͘G̴̻͈͍͔̹̑͗̎̅͛́Ǫ̵̹̻̝̳͂̌̌͘!͖̬̰̙̗̿̋ͥͥ̂ͣ̐́́͜͞' console.log(a.length) // 75 console.log(Array.from(a))

The scenarios mentioned are not issues per se. You are basically converting the string to corresponding unicode hex. But, each grapheme or a symbol doesn't necessarily correspond to a single hex in the output. You can keep these in mind or add them to your edge cases / test cases.

Some further reading:

JavaScript has a Unicode problem
What every JavaScript developer should know about Unicode
Full Emoji List, v13.0 (You can see the Code column to know if emojis are created from multiple emojis)
Emoji Zero Width Joiner ZWJ Sequence

Also, codePointAt takes a number as a parameter.

return character.codePointAt(undefined).toString(16)

is same as

return character.codePointAt().toString(16)

Both of these work because if the argument is undefined, it defaults to 0. It's better to pass 0 explicitly as it is easily understandable. It wasn't clear why you were passing undefined initially.

return character.codePointAt(0).toString(16)

Thanks for the detailed answer!... Passing undefined to codePointAt was done to prevent TypeScript from complaining, but I'll certainly consider switching to 0 as that seems to be less confusing for readers (myself included)... As for Array.from(_string_) vs. [..._string_], so far tests seem to produce equivalent output, and last I read the spread syntax was the short-hand way of avoiding issues with String.split("") doing unkind things to non-ASCII strings. — S0AndS0, CommentedAug 7, 2020 at 0:36
@S0AndS0 sorry, I wan't clear in the answer. Array.from() and [...] are equivalent for a string input. They both use the Symbol.iterator property of the string to create an array. — adiga, CommentedAug 7, 2020 at 5:10

Stack Exchange Network

JavaScript string to Unicode (Hex)

Questions

Setup and Usage

Source Code

JestJS Tests

1 Answer 1

Hot Network Questions

JavaScript string to Unicode (Hex)

Questions

Setup and Usage

Source Code

JestJS Tests

1 Answer 1

Related

Hot Network Questions