JSON4 parser in Typescript - Optimized

Question

Context

This review-request is a follow-up to this question.

After the initial implementation, which focused on spec-compliance mainly, I have made some revisions to improve the performance of the parser. With all the changes made, I got the time to parse down from ~5 seconds to about ~100ms, for a 10000 line JSON file (included in the repo as a reference - Run yarn test:perf).

In the git repository I pulled in a "JSON test suite", which I used to run tests against the code. To help anyone kind enough to review my code, I've published this thing in my "pet project" github repo, with everything already set up, so you just have to clone that and install using yarn.

Main improvements I have made:

Implement an ArrayView like structure which avoids mutating the Token[] while parsing
Hand-roll some lexer logic instead of using RegEx
Implement most suggestions from the previous review
Many smaller improvements to spec-compliance and memory footprint

The focus of this code-base:

Correctness (Spec-compliance)
Performance
User-Experience (Good error messages, Ease of use)
Code Quality

Questions

How could I further improve the performance?
Are there any alternative algorithms I could use?
How readable is the code?
Are there any best-practices I could implement?

This is a fair amount of code, but there is no need to review all of it.

Code

Json.ts

import { Node } from "./lib/nodes/Node"; import { parse } from "./lib/parser"; import { tokenize } from "./lib/lexer"; import { JsonValue } from "./lib/types"; import { TokenList } from "./lib/util/TokenList"; export class Json { static deserialize(source: string): JsonValue { const root = this.parse(source); return root.toJsValue(); } static tokenize(source: string): TokenList { return tokenize(source); } static parse(source: string): Node { return parse(tokenize(source)); } }

lexer.ts

import { TokenList } from "./util/TokenList"; import { PUNCTUATION_TOKENS, Token } from "./Token"; import { JsonLexerError } from "./util/JsonLexerError"; type Tokenizer = (source: string, cursor: number) => TokenizerResult; type TokenizerResult = | { matched: false; token: null; cursor: null } | { matched: true; token: Token; cursor: number }; const NO_MATCH = { matched: false, token: null, cursor: null } as const; export function tokenize(source: string): TokenList { let cursor = 0; const tokens: Token[] = []; const tokenizers: Tokenizer[] = [ tokenizePunctuation, tokenizeNull, tokenizeNumber, tokenizeString, tokenizeBoolean, ]; tokenLoop: while (cursor < source.length) { // Skip whitespace tokens if ( source[cursor] === " " || source[cursor] === "\n" || source[cursor] === "\t" || source[cursor] === "\r" || source[cursor] === "\v" ) { cursor++; continue tokenLoop; } let didMatch = false; for (const tokenizer of tokenizers) { const result = tokenizer(source, cursor); if (result === NO_MATCH) { continue; } didMatch = true; cursor = result.cursor!; tokens.push(result.token!); break; } if (!didMatch) { throw new JsonLexerError("Unknown token", source, cursor); } } return new TokenList(tokens); } const NULL_TOKEN = new Token("null", "null"); function tokenizeNull(source: string, cursor: number): TokenizerResult { return matchStaticToken(source, cursor, NULL_TOKEN); } const TRUE_TOKEN = new Token("boolean", "true"); const FALSE_TOKEN = new Token("boolean", "false"); const BOOLEAN_TOKENS = [TRUE_TOKEN, FALSE_TOKEN]; function tokenizeBoolean(source: string, cursor: number): TokenizerResult { for (const token of BOOLEAN_TOKENS) { const result = matchStaticToken(source, cursor, token); if (result.matched) { return result; } } return NO_MATCH; } function tokenizeNumber(source: string, cursor: number): TokenizerResult { let internalCursor = cursor; function isDigit(index: number) { const value = source[index]; return value >= "0" && value <= "9"; } function skipDigits() { let hasLeadingZero = false; let totalDigitsMatched = 0; if (source[internalCursor] === "0") { hasLeadingZero = true; } while (internalCursor < source.length) { if (isDigit(internalCursor)) { internalCursor++; totalDigitsMatched++; } else { break; } } if (totalDigitsMatched === 0) { throw new JsonLexerError( "Invalid number! (Expected at least one digit at this location!)", source, internalCursor, ); } return { hasLeadingZero, totalDigitsMatched }; } // Parse optional sign of integer part if (source[internalCursor] === "-") { internalCursor++; // Throw on lonely minus if (!isDigit(internalCursor)) { throw new JsonLexerError("Invalid lonely minus sign", source, cursor); } } // Not a number if current cursor is not a digit - Return early if (!isDigit(internalCursor)) { return NO_MATCH; } // Parse the integer part const integerPart = skipDigits(); if (integerPart!.hasLeadingZero && integerPart!.totalDigitsMatched > 1) { throw new JsonLexerError( "Invalid number! (No leading zeroes on fractions allowed!)", source, cursor, ); } // Parse optional fraction part if (source[internalCursor] === ".") { internalCursor++; skipDigits(); } // Parse optional exponent part if (source[internalCursor] === "e" || source[internalCursor] === "E") { internalCursor++; // Parse sign of exponent if (source[internalCursor] === "+" || source[internalCursor] === "-") { internalCursor++; } // Parse exponent number skipDigits(); } return { matched: true, cursor: internalCursor, token: new Token("number", source.substring(cursor, internalCursor)), }; } function tokenizeString(source: string, cursor: number): TokenizerResult { // Early return if the current char is not a " it cannot be a string therefore exit early if (source[cursor] !== '"') { return NO_MATCH; } let nextQuoteIndex = source.indexOf('"', cursor + 1); // While there are more characters to process collect them into the value variable. while (nextQuoteIndex !== -1 && nextQuoteIndex + 1 <= source.length) { // The second unescaped " closes the string if (source[nextQuoteIndex - 1] !== "\\") { const content = source.substring(cursor + 1, nextQuoteIndex); // A json string cannot contain a new-line if (content.indexOf("\n") !== -1) { return NO_MATCH; } return { matched: true, cursor: nextQuoteIndex + 1, token: new Token("string", content), }; } nextQuoteIndex = source.indexOf('"', nextQuoteIndex + 1); } return NO_MATCH; } const PUNCTUATION_TOKEN_VALUES = Object.values(PUNCTUATION_TOKENS); function tokenizePunctuation(source: string, cursor: number): TokenizerResult { for (const token of PUNCTUATION_TOKEN_VALUES) { const matchResult = matchLiteral(source, cursor, token.value); if (matchResult.matched) { return { matched: true, token, cursor: matchResult.cursor, }; } } return NO_MATCH; } /* * Helper Functions */ function matchLiteral( source: string, cursor: number, token: string, ): { matched: false } | { matched: true; cursor: number } { if (source.substring(cursor, cursor + token.length) === token) { return { matched: true, cursor: cursor + token.length, }; } return NO_MATCH; } function matchStaticToken(source: string, cursor: number, token: Token): TokenizerResult { const lookahead = matchLiteral(source, cursor, token.value); if (lookahead.matched) { return { matched: true, cursor: lookahead.cursor, token, }; } return NO_MATCH; }

parser.ts

import { TokenList } from "./util/TokenList"; import { JsonParserError } from "./util/JsonParserError"; import { Node, ArrayNode, ObjectNode, AnyScalarNode, NullScalarNode, NumberScalarNode, StringScalarNode, BooleanScalarNode, } from "./nodes"; export function parse(tokens: TokenList): Node { const rootNode = parseSingle(tokens); if (tokens.length > 0) { throw new JsonParserError("Unexpected tokens at the end of source", tokens); } return rootNode; } export function parseSingle(tokens: TokenList): Node { if (tokens.length === 0) { throw new JsonParserError("Unexpected end of source!", tokens); } const initialToken = tokens.get(0); if (initialToken.isScalar) { return parseScalar(tokens); } if (initialToken.isObjectOpen) { return parseObject(tokens); } if (initialToken.isArrayOpen) { return parseArray(tokens); } throw new JsonParserError( `Could not parse token of type '${initialToken.type}' at this location`, tokens, ); } function parseScalar(tokens: TokenList): AnyScalarNode { const token = tokens.next(); if (token == null) { throw new JsonParserError("Unexpected end of file!", tokens); } switch (token.type) { case "null": return new NullScalarNode(); case "number": return NumberScalarNode.fromString(token.value); case "string": return new StringScalarNode(parseString(token.value, tokens)); case "boolean": return BooleanScalarNode.fromString(token.value); default: throw new JsonParserError( `Could not parse scalar "${token.value}" (Unknown type "${token.type}")`, tokens, ); } } function parseArray(tokens: TokenList): ArrayNode { const arrayNode = new ArrayNode(); // Removes the opening "[" token. tokens.next(); const firstToken = tokens.get(0); // Empty Array, exit early. if (firstToken && firstToken.isArrayClose) { tokens.next(); return arrayNode; } while (tokens.length > 0) { arrayNode.addChild(parseSingle(tokens)); // The next token is either a comma or it is the closing bracket. // In both cases the token needs to be removed. We just need to keep it around // to check if it is a comma. const nextToken = tokens.next(); // If the next token "after" the value is not a comma, we do not expect // any more values. Technically we don't even need the comma, but we stick // to the standard strictly. if (nextToken && nextToken.isComma) { continue; } if (nextToken && nextToken.isArrayClose) { return arrayNode; } throw new JsonParserError("Additional comma at end of array entries", tokens); } throw new JsonParserError( "Unexpected token at the end of an array entry, most likely a missing comma", tokens, ); } function parseObject(tokens: TokenList) { const objectNode = new ObjectNode(); tokens.next(); const firstToken = tokens.get(0); // Empty Object, exit early if (firstToken && firstToken.isObjectClose) { tokens.next(); return objectNode; } while (tokens.length > 0) { objectNode.addEntry(parseObjectEntry(tokens)); const nextToken = tokens.next(); // If there is a comma, the json spec specifies that there *must* // be another entry on the object if (nextToken && nextToken.isComma) { continue; } // If the next token is not a comma, there are no more entries // which means that the next token *must* be a "}" if (nextToken && nextToken.isObjectClose) { return objectNode; } throw new JsonParserError( "Unexpected token at the end of an object entry, most likely a missing comma", tokens, ); } throw new JsonParserError("Unexpected end of source, while parsing object", tokens); } function parseObjectEntry(tokens: TokenList) { const keyToken = tokens.next(); const separatorToken = tokens.next(); if (!keyToken || !keyToken.isString) { throw new JsonParserError( `Unexpected token of type "${keyToken.type}" ("${keyToken.value}") used as object key`, tokens, ); } if (!separatorToken || !separatorToken.isColon) { throw new JsonParserError( `Unexpected token of type "${separatorToken.type}" ("${separatorToken.value}") used as object key-value separator`, tokens, ); } return { key: parseString(keyToken.value, tokens), value: parseSingle(tokens), }; } function parseString(value: string, tokens: TokenList) { // Short-circuit optimization - If the string contains no backslash // then it cannot include escape sequences that would need to be parsed. if (value.indexOf("\\") === -1 && value.indexOf(" ") === -1) { return value; } const chars = value.split(""); for (let index = 0; index < chars.length; index++) { const element = chars[index]; if ( element === "\t" || element === "\n" || element === "\b" || element === "\f" || element === "\r" ) { throw new JsonParserError( "Invalid characters in string. Control characters must be escaped!", tokens, ); } // If this character is not the start of an escape sequence continue straight away... if (element !== "\\") { continue; } // Escape sequences have a minimum length of one. Example: \n if (chars.length <= index + 1) { throw new JsonParserError("Unexpected end of escape-sequence", tokens); } // This specifies which escape sequence is used. const escapeCharacter = chars[index + 1]; // The backslash that initiates the escape sequence is always removed... chars[index] = ""; index++; if ( escapeCharacter === "\\" || escapeCharacter === '"' || escapeCharacter === "/" || escapeCharacter === "b" || escapeCharacter === "f" || escapeCharacter === "n" || escapeCharacter === "r" || escapeCharacter === "t" || escapeCharacter === '"' ) { const simpleEscapeSequences = { '"': '"', "\\": "\\", // Yeah the JSON spec does this... It is called the "solidus" escape sequence "/": "/", b: "\b", f: "\f", n: "\n", r: "\r", t: "\t", }; chars[index] = simpleEscapeSequences[ // Ugly type cast needed here because typescript does not narrow the type the key because of the if above... escapeCharacter as keyof typeof simpleEscapeSequences ]; continue; } // The escape sequence \u is special it encodes a specific character by its unicode id // Example: "\u1234" means unicode codepoint with id 1234 which translates to a specific character if (escapeCharacter === "u") { if (chars.length < index + 5) { throw new JsonParserError("Unexpected end of escape-sequence", tokens); } const unicodeEscapeSequence = chars.slice(index + 1, index + 5).join(""); // Check validity of escape sequence only hex characters are allowed. if (!/^[0-9A-Fa-f]{4}$/.test(unicodeEscapeSequence)) { throw new JsonParserError("Invalid unicode escape sequence", tokens); } // Construct the unicode character from its numbers chars[index] = String.fromCodePoint(Number(`0x${unicodeEscapeSequence}`)); // Delete the original escape sequence chars[index + 1] = ""; chars[index + 2] = ""; chars[index + 3] = ""; chars[index + 4] = ""; index += 4; continue; } throw new JsonParserError("Unrecognized escape sequence", tokens); } return chars.join(""); }

Token.ts

type TokenType = "punctuation" | "boolean" | "string" | "number" | "null"; export class Token { public readonly type: TokenType; public readonly value: string; public constructor(type: TokenType, value: string) { this.type = type; this.value = value; } public get isString(): boolean { return this.type === "string"; } public get isNumber(): boolean { return this.type === "number"; } public get isBoolean(): boolean { return this.type === "boolean"; } public get isNull(): boolean { return this.type === "null"; } public get isScalar(): boolean { return this.isNull || this.isString || this.isNumber || this.isBoolean; } public get isPunctuation(): boolean { return this.type === "punctuation"; } public get isArrayOpen(): boolean { return isPredefinedPunctuation("arrayOpen", this); } public get isArrayClose(): boolean { return isPredefinedPunctuation("arrayClose", this); } public get isObjectOpen(): boolean { return isPredefinedPunctuation("objectOpen", this); } public get isObjectClose(): boolean { return isPredefinedPunctuation("objectClose", this); } public get isComma(): boolean { return isPredefinedPunctuation("comma", this); } public get isColon(): boolean { return isPredefinedPunctuation("colon", this); } } export const PUNCTUATION_TOKENS = { comma: new Token("punctuation", ","), colon: new Token("punctuation", ":"), arrayOpen: new Token("punctuation", "["), arrayClose: new Token("punctuation", "]"), objectOpen: new Token("punctuation", "{"), objectClose: new Token("punctuation", "}"), }; function isPredefinedPunctuation( key: keyof typeof PUNCTUATION_TOKENS, token: Token, ): boolean { return token.isPunctuation && PUNCTUATION_TOKENS[key].value === token.value; }

types.ts

export type JsonScalar = boolean | number | string | null; export type JsonObject = { [key: string]: JsonValue }; export type JsonArray = JsonValue[]; export type JsonValue = JsonObject | JsonArray | JsonScalar;

nodes/Node.ts

import { JsonValue } from "../types"; /** * Simple node base class. * * This class mainly exists to make typing a little easier. */ export abstract class Node { public abstract toJsValue(): JsonValue; }

nodes/ScalarNode.ts

import { Node } from "./Node"; import { JsonValue } from "../types"; export abstract class ScalarNode<T extends JsonValue> extends Node { public readonly value: T; public constructor(value: T) { super(); this.value = value; } public toJsValue(): JsonValue { return this.value; } } export type AnyScalarNode = | NullScalarNode | StringScalarNode | NumberScalarNode | BooleanScalarNode; export class NullScalarNode extends ScalarNode<null> { constructor() { super(null); } } export class StringScalarNode extends ScalarNode<string> {} export class NumberScalarNode extends ScalarNode<number> { public static fromString(value: string): NumberScalarNode { return new NumberScalarNode(parseInt(value)); } } export class BooleanScalarNode extends ScalarNode<boolean> { public static fromString(value: string): BooleanScalarNode { if (value === "true") { return BOOL_NODE_TRUE; } if (value === "false") { return BOOL_NODE_FALSE; } throw new Error(`Invalid boolean value "${value}" could not be parsed`); } } const BOOL_NODE_TRUE = new BooleanScalarNode(true); const BOOL_NODE_FALSE = new BooleanScalarNode(false);

nodes/ObjectNode.ts

import { Node } from "./Node"; import { JsonObject, JsonValue } from "../types"; export class ObjectNode extends Node { public readonly entries: Map<string, Node>; public constructor() { super(); this.entries = new Map(); } public addEntry(entry: { key: string; value: Node }): void { this.entries.set(entry.key, entry.value); } public toJsValue(): JsonValue { const result: JsonObject = {}; for (const [key, value] of this.entries) { result[key] = value.toJsValue(); } return result; } toJSON() { return this.toJsValue(); } }

nodes/ArrayNode.ts

import { Node } from "./Node"; import { JsonArray, JsonValue } from "../types"; export class ArrayNode extends Node { public readonly children: Node[]; public constructor() { super(); this.children = []; } public addChild(value: Node): void { this.children.push(value); } public toJsValue(): JsonValue { const result: JsonArray = []; for (const value of this.children) { result.push(value.toJsValue()); } return result; } }

util/JsonLexerError.ts

export class JsonLexerError extends Error { constructor(message: string, source: string, cursor: number) { super( `Failed to parse:\nReason: ${message}\nLocation: ${source.substring(cursor - 2, cursor + 100)}\n ^`, ); } }

util/JsonParserError.ts

import { TokenList } from "./TokenList"; export class JsonParserError extends Error { constructor(message: string, tokens: TokenList) { super(formatMessage(message, tokens)); } } function serializeTokens(tokens: TokenList) { return tokens .asArray() .map(token => { if (token.isString) { return `"${token.value}"`; } return token.value; }) .join(""); } function formatMessage(message: string, tokens: TokenList): string { const source = serializeTokens(tokens); return `Could not parse JSON!\nReason: ${message}\nSource point: ${source}\n ^`; }

util/TokenList.ts

import { Token } from "../Token"; // This is effectively an ArrayView implementation // It has some extra functionality that makes it // more ergonomic when using it in a parser context. export class TokenList { private tokens: Token[]; private position: number; constructor(tokens: Token[]) { this.tokens = tokens; this.position = 0; } get length() { return this.tokens.length - this.position; } get(index: number) { return this.tokens[this.position + index]; } next() { this.position++; return this.tokens[this.position - 1]; } asArray() { // There is no way to return just the elements "after" the current position // without mutating the array or making copies. Therefore we use this opportunity // to mutate the current array as a "cleanup" measure and return that. this.tokens = this.tokens.splice(0, this.position); this.position = 0; return this.tokens; } }

My node23 on x64 MacOS 404’d trying to yarn install. (And unsurprisingly , build from source failed.) I think that profiler is only supported up till node22. tl;dr: I was unable to identify which lines are the hot spots in that workload. — J_H, CommentedNov 8, 2024 at 17:40

konijn · Accepted Answer · 2024-11-13 09:55:04Z

This code looks great, I would want to maintain this.

I only found a few things in parseString

In the olden days, comparing with -1 was faster bitwise so value.indexOf("\\") === -1 becomes !~value.indexOf("\\") You should test if this is still true
You keep using === which compares type and value in TypeScript which already guarantees type, so why use === instead of ==? I think this might slow down the execution for no particularly good reason.
You want to go with const charLength = chars.length as a micro optimization
In JS, you can access string characters directly, no need to split the string. Later I see you change the characters, which makes that observation pointless. I still wonder.. could it be more efficient to not split and build the string as you go?

Stack Exchange Network

JSON4 parser in Typescript - Optimized

Context

Questions

Code

1 Answer 1

Linked

Hot Network Questions

JSON4 parser in Typescript - Optimized

Context

Questions

Code

1 Answer 1

Linked

Related

Hot Network Questions