8
\$\begingroup\$

This set of questions are related to a project I've published for converting characters, or strings, to Hex based Unicode; eg...

toUnicode.fromCharacter('🍍'); //> 1f34d toUnicode.fromString('Spam!', '0x'); //> ['0x53', '0x70', '0x61', '0x6d', '0x21'] 

Questions

  • Are there any mistakes, such as unaccounted for edge-cases?

  • Have I missed any test cases?

  • Any suggestions on making the code more readable and/or easier to extend?

  • Are there any features that are wanted?


Setup and Usage

The Source code is maintained on GitHub and may be cloned via the following commands. A Live demo is hosted online, thanks to GitHub Pages.

mkdir -vp ~/git/hub/javascript-utilities cd ~/git/hub/javascript-utilities git clone [email protected]:javascript-utilities/to-unicode.git 

The build target is ECMAScript version 6, and so far both manual tests and automated JestJS tests show that the toUnicode methods function as intended; for both Browser and NodeJS environments.

Example NodeJS Usage

const toUnicode = require('./to-unicode.js'); var panda_code = toUnicode.fromCharacter('🐼'); console.log(panda_code); //> '1f43c' 

Source Code

I am concerned with improving the JavaScript, and TypeScript; ie. HTML is intended to be simple and functional.

'use strict'; /** * Namespace for static methods that convert characters and strings to Unicode */ class toUnicode { /** * Converts character to Hex Unicode * @param {string} character * @return {string} * @author S0AndS0 * @copyright AGPL-3.0 * @example * toUnicode.fromCharacter('🐼'); * //> "1f43c" */ static fromCharacter(character) { return character.codePointAt(undefined).toString(16); } /** * Converts string to character array of Unicode(s) * @param {string} characters * @return {string[]} * @author S0AndS0 * @copyright AGPL-3.0 * @example * toUnicode.fromString('🎉 👋'); * //> [ '1f389', '20', '1f44b' ] */ static fromString(characters, prefix = '') { return [...characters].reduce((accumulator, character) => { const unicode = toUnicode.fromCharacter(character); accumulator.push(`${prefix}${unicode}`); return accumulator; }, []); } } /* istanbul ignore next */ if (typeof module !== 'undefined') { module.exports = toUnicode; }
<!DOCTYPE html> <html lang="en" dir="ltr"> <head> <meta charset="utf-8"> <title>toUnicode Usage Example</title> <script type="text/javascript" src="assets/js/modules/to-unicode.js" differ></script> <script type="text/javascript" differ> const text_input__callback = (_event) => { const client_input = document.getElementById('client__text--input').value; const client_prefix = document.getElementById('client__text--prefix').value; const output_element = document.getElementById('client__text--output'); const unicode_list = toUnicode.fromString(client_input, client_prefix); console.log(unicode_list); output_element.innerText = unicode_list.join('\n'); }; window.addEventListener('load', () => { const client_text_input = document.getElementById('client__text--input'); const client_text_prefix = document.getElementById('client__text--prefix'); client_text_input.addEventListener('input', text_input__callback); client_text_prefix.addEventListener('input', text_input__callback); }); </script> </head> <body> <span>Prefix: </span> <input type="text" id="client__text--prefix" value="0x"> <br> <span>Input: </span> <input type="text" id="client__text--input" value=""> <pre id="client__text--output"></pre> </body> </html>


JestJS Tests

For completeness here are the JestJS tests.

'use strict'; /** * Tests modules within `to-unicode.js` script * @author S0AndS0 * @copyright AGPL-3.0 */ class toUnicode_Test { constructor(min_code_point = 161, max_code_point = 1114111) { this.toUnicode = require('../to-unicode.js'); this.min_code_point = min_code_point; this.max_code_point = max_code_point; } randomCodePoint() { return Math.random() * (this.max_code_point - this.min_code_point + 1) + this.min_code_point | 0; } runTests() { this.testInvariance(); } /** * Tests if `fromCharacter()` and `fromString()` functions are reversible. */ testInvariance() { const character_code_list = Array(99).fill(0).map((_) => { return this.randomCodePoint(); }); let unicode_list = []; let characters_string = ''; test('Is `fromCharacter()` reversible?', () => { character_code_list.forEach((code_point) => { const character = String.fromCodePoint(code_point); const unicode = this.toUnicode.fromCharacter(character); const decimal = Number(`0x${unicode}`); expect(decimal).toEqual(code_point); unicode_list.push(unicode); characters_string += character; }); }); test('Is `fromString()` reversible?', () => { expect(this.toUnicode.fromString(characters_string)).toStrictEqual(unicode_list); }); } } const test_toUnicode = new toUnicode_Test(); test_toUnicode.runTests(); 
\$\endgroup\$

    1 Answer 1

    7
    \$\begingroup\$

    This is regarding the edge-cases and test cases mentioned in the question:

    [...characters] // or Array.from(characters) 

    handles splitting the characters of string to an array in most of the cases. It is better than characters.split("") because it handles surrogate pairs pretty well.

    console.log( "🍍".length ) // 2 console.log( [..."🍍"] ) console.log( "🍍".split("") )

    What if you input something like 👨‍👩‍👧‍👦? You get

    0x1f468 0x200d 0x1f469 0x200d 0x1f467 0x200d 0x1f466 

    and not a single hex as output like before. Because, there are hundreds of emoji sequences which are combinations of multiple emojis but display as a single emoji. They are joined with a Zero Width Joiner (U+200D) character. When you use [...] on them, they are split into an array of individual emojis and joiner characters.

    console.log("👁️‍🗨️".length) // 7 console.log(Array.from("👁️‍🗨️")) console.log("👨‍👩‍👧‍👦".length) // 11 console.log(Array.from("👨‍👩‍👧‍👦"))

    Similarly, many languages create a grapheme or a symbol with combining marks. They look like distinctive units of writing, but are made up of multiple unicode points.

    The below strings are not the same. The first string has á but the second string is a and a combining mark U+0301

    const a = "álgebra", b = "álgebra" console.log(a === b) // false console.log(a.length, b.length) console.log([...a].join(" , ")) console.log([...b].join(" , ")) console.log([..."हिन्दी"].join(" , ")) // Devanagari script

    ि is a vowel sound and isn't used on it's own. It needs to be combined to with a consonant like (Ha) to get हि (He)

    You can create big strings using multiple combining marks while the string looks like it has 6 distinct characters:

    const a = 'Z͑ͫ̓ͪ̂ͫ̽͏̴̙̤̞͉͚̯̞̠͍A̴̵̜̰͔ͫ͗͢L̠ͨͧͩ͘G̴̻͈͍͔̹̑͗̎̅͛́Ǫ̵̹̻̝̳͂̌̌͘!͖̬̰̙̗̿̋ͥͥ̂ͣ̐́́͜͞' console.log(a.length) // 75 console.log(Array.from(a))

    The scenarios mentioned are not issues per se. You are basically converting the string to corresponding unicode hex. But, each grapheme or a symbol doesn't necessarily correspond to a single hex in the output. You can keep these in mind or add them to your edge cases / test cases.

    Some further reading:


    Also, codePointAt takes a number as a parameter.

    return character.codePointAt(undefined).toString(16) 

    is same as

    return character.codePointAt().toString(16) 

    Both of these work because if the argument is undefined, it defaults to 0. It's better to pass 0 explicitly as it is easily understandable. It wasn't clear why you were passing undefined initially.

    return character.codePointAt(0).toString(16) 
    \$\endgroup\$
    2
    • \$\begingroup\$Thanks for the detailed answer!... Passing undefined to codePointAt was done to prevent TypeScript from complaining, but I'll certainly consider switching to 0 as that seems to be less confusing for readers (myself included)... As for Array.from(_string_) vs. [..._string_], so far tests seem to produce equivalent output, and last I read the spread syntax was the short-hand way of avoiding issues with String.split("") doing unkind things to non-ASCII strings.\$\endgroup\$
      – S0AndS0
      CommentedAug 7, 2020 at 0:36
    • 1
      \$\begingroup\$@S0AndS0 sorry, I wan't clear in the answer. Array.from() and [...] are equivalent for a string input. They both use the Symbol.iterator property of the string to create an array.\$\endgroup\$
      – adiga
      CommentedAug 7, 2020 at 5:10

    Start asking to get answers

    Find the answer to your question by asking.

    Ask question

    Explore related questions

    See similar questions with these tags.