So I've written a function which parses a fairly simple key-value pair syntax. Each pair can span across multiple lines, as long as the value does not have a colon in it. If it does, then any new line must be preceded by three spaces. For each pair, I create an object with the key, the value, and the offset at which they appear (from the beginning of the string).
You can get a better idea of this syntax from the following image (keys in blue, values in green):
I thought about using regex, but seeing as I also need to keep track of the offsets for each item, and performance is extremely important - I thought it may be easier/more efficient to just use plain typescript. Here's the function I came up with:
function parseTitlePageChunk(text: string):{key:string, value:string, keyoffset:number, valueoffset:number}[] { console.time("pairs"); let pairs = []; let potentialValue = ""; //keep track of a string which may be a key let potentialKey = ""; //keep track of a string which may be a value let potentialKeyOffset = 0; let potentialValueOffset = 0; let colonInLine = false; let forceValue = false; //true if a line starts with three spaces let spaceCounter = -1; //if the spaceCounter==-1 there's no more spaces at the beginning of the line for (let i = 0; i < text.length; i++) { let c = text[i]; if (c == ':' && !colonInLine && !forceValue) { //We ran into a colon, promote the potential key to an actual one pairs.push({key: potentialKey, value:"", keyoffset:potentialKeyOffset, valueoffset:potentialValueOffset}); potentialValue = ""; //reset the potential value potentialValueOffset = i+1; //reset the potential value offset colonInLine = true; } else if (c == '\n' && pairs.length > 0) { //we hit a new line, and there exists a previous key pairs[pairs.length - 1].value = potentialValue; //set the value of the previous key pairs[pairs.length - 1].valueoffset = potentialValueOffset; potentialValue += '\n'; potentialKey = "\n"; potentialKeyOffset = i; colonInLine = false; forceValue = false; spaceCounter = 0; } else { if(spaceCounter!=-1 && c == ' '){ spaceCounter++; } else{ spaceCounter = -1; } potentialValue += c; potentialKey += c; if(spaceCounter>=3) forceValue=true; } } if (pairs.length > 0) { //add the last potential value as a key pairs[pairs.length - 1].value = potentialValue; pairs[pairs.length - 1].valueoffset = potentialValueOffset; } console.timeEnd("pairs"); return pairs; }
Here's a sample input:
Key: in-line value Key2: in-line value: with a colon Key3: multi-line value which continues here Key4: multi-line value which continues here: has a colon yet is still a value
which outputs the following (stringified to JSON):
[{ "key": "Key", "value": " in-line value", "keyoffset": 0, "valueoffset": 4 }, { "key": "\nKey2", "value": " in-line value: with a colon", "keyoffset": 18, "valueoffset": 24 }, { "key": "\nKey3", "value": " multi-line value\nwhich continues here", "keyoffset": 52, "valueoffset": 58 }, { "key": "\nKey4", "value": " multi-line value which\n continues here: has a colon\n yet is still a value\n", "keyoffset": 96, "valueoffset": 102 }]
It works pretty flawlessly as far as I can tell, and seems quite efficient (it's only iterating through the string once), but I also find it's a little overcomplicated for what seems like a fairly simple task. However I can't figure out how to simplify it any further. Thoughts?
key: xxx/
new line__yyy/
new line__zz
. Which seems a bit more safe. But that is a matter of taste.\$\endgroup\$