Parse a stream of space separated numbers

Question

I need to parse a stream of characters IEnumerable<char> as a list of space (and newline) separated numbers.

Numbers can be expressed in base 10, base 8 (with 0 prefix) and base 16 (with 0x prefix). They can be both positive and negative (also for hex numbers then, for example, -0xff is a valid input), leading sign is optional.

FormatException should be thrown for invalid numbers and OverflowException should be thrown for out of Int32 range numbers. Parsing does not need to support thousands separator and input characters are always in the ASCII range.

Parsing has to be as fast as possible but without sacrificing readability too much.

Implementation (few more helper methods are omitted) is:

public sealed class Int32Parser { public static IEnumerable<int> Parse(string text) => new Int32Parser().Consume(text); public IEnumerable<int> Consume(IEnumerable<char> stream) { if (stream == null) throw new ArgumentNullException(nameof(stream)); _stream = stream.GetEnumerator(); _hasMoreCharactersToConsume = TryMoveToNextCharacter(); while (_hasMoreCharactersToConsume) { if (ParseNextNumber(out int number)) yield return number; } } private IEnumerator<char> _stream; private bool _hasMoreCharactersToConsume; private char CurrentCharacter => _stream.Current; private bool ParseNextNumber(out int number) { number = 0; if (!SkipLeadingWhiteSpaces()) return false; int sign = ParseOptionalSign(); int numericBase = ParseOptionalBasePrefix(); while (!IsWhiteSpaceOrNewLine(CurrentCharacter)) { int digit = ParseDigit(CurrentCharacter); if (digit >= numericBase) throw new FormatException($"Invalid digit {digit} for base {numericBase}."); checked { number = (number * numericBase) + digit; } if (!TryMoveToNextCharacter()) break; } number = sign * number; return true; } private bool SkipLeadingWhiteSpaces() { while (IsWhiteSpaceOrNewLine(CurrentCharacter)) { if (!TryMoveToNextCharacter()) return false; } return true; } private int ParseOptionalSign() { if (CurrentCharacter == '-') { MoveToNextCharacter(); return -1; } if (CurrentCharacter == '+') MoveToNextCharacter(); return 1; } private int ParseOptionalBasePrefix() { if (CurrentCharacter != '0') return 10; if (!TryMoveToNextCharacter()) return 10; if (CurrentCharacter != 'x' && CurrentCharacter != 'X') return 8; MoveToNextCharacter(); return 16; } private int ParseDigit(char c) { if (c >= '0' && c <= '9') return c - '0'; if (c >= 'A' && c <= 'F') return c - 'A' + 10; if (c >= 'a' && c <= 'f') return c - 'a' + 10; throw new FormatException($"Unknown digit {c}"); } private static bool IsWhiteSpaceOrNewLine(char c) => c == ' ' || c == '\t' || c == '\n' || c == '\r'; private bool TryMoveToNextCharacter() => _hasMoreCharactersToConsume = _stream.MoveNext(); private void MoveToNextCharacter() { if (!TryMoveToNextCharacter()) throw new FormatException("Expected more characters."); } }

A simple test:

foreach (var number in Int32Parser.Parse(" 2147483647 -1 -2 3 0xa 0xb 010 0 0x \n+1 +100 -33 \n")) Console.WriteLine(number);

Direct performance comparison isn't fair (for Int32.Parse()) because requirements are pretty different (and the test measures also quality of my own implementation together with String.Split()) but here a small quick-and-dirty performance comparison:

static class Program { static void Main(string[] args) { foreach (int numberOfDigits in Enumerable.Range(1, 9)) Compare(numberOfDigits); } private static void Compare(int numberOfDigits) { Console.WriteLine("Numbers with {0} digits", numberOfDigits); var testString = CreateTestString(numberOfDigits); Console.WriteLine(" Testing native 1"); Console.WriteLine(" Total elapsed time: {0} ticks", Measure(testString, ParseNetNative1)); Console.WriteLine(" Testing native 2"); Console.WriteLine(" Total elapsed time: {0} ticks", Measure(testString, ParseNetNative2)); Console.WriteLine(" Testing custom"); Console.WriteLine(" Total elapsed time: {0} ticks", Measure(testString, ParseCustom)); } private static string CreateTestString(int numberOfDigits) { int minValue = (int)Math.Pow(10, numberOfDigits) / 10; int maxValue = (int)Math.Pow(10, numberOfDigits + 1) - 1; var rnd = new Random(); return String.Join(" ", GenerateRandomNumbers()) + "\n" + String.Join(" ", GenerateRandomNumbers()); IEnumerable<int> GenerateRandomNumbers() => Enumerable.Range(1, 10000).Select(x => rnd.Next(minValue, minValue) * ((rnd.Next() & 1) == 1 ? 1 : -1)); } private static long Measure(string text, Func<string, int> method) { const int repetitions = 1000; int result = method(text); var stopwatch = new Stopwatch(); stopwatch.Start(); for (int i=0; i < repetitions; ++i) result = method(text); stopwatch.Stop(); return stopwatch.ElapsedTicks / repetitions; } private static int ParseNetNative1(string text) { return text.Split(new char[] { ' ', '\t', '\n', '\r' }, StringSplitOptions.RemoveEmptyEntries) .Select(x => Int32.Parse(x, NumberStyles.Number, CultureInfo.InvariantCulture)) .Count(); } private static int ParseNetNative2(string text) { var numbersToParse = text.Split(new char[] { ' ', '\t', '\n', '\r' }, StringSplitOptions.RemoveEmptyEntries); var numbers = new int[numbersToParse.Length]; for (int i = 0; i < numbers.Length; ++i) numbers[i] = Int32.Parse(numbersToParse[i], NumberStyles.Number, CultureInfo.InvariantCulture); return numbers.Length; } private static int ParseCustom(string text) { return Int32Parser.Parse(text).Count(); } }

Int32.Parse() does not support negative hex numbers then they're excluded from this comparison but I suppose a similar figure. A quick plot follows; numbers are in ticks for a long string made of 20k numbers with N digit, roughly 50% of them are negative, first 10k separated with one space then one new line and then another 10 k separated with two spaces (see CreateTestString()):

Parsing a string (without hex negative numbers) is not the only use-case but it's an important one, if TestNative1() or TestNative2() are obviously inefficient then I'm open to critiques also about them. Here a reference implementation as used in tests:

private const char[] Separators = new char[] { ' ', '\t', '\n', '\r' }; private static IEnumerable<int> ParseNetNative1(string text) { return text.Split(separators, StringSplitOptions.RemoveEmptyEntries) .Select(x => Int32.Parse(x, NumberStyles.Number, CultureInfo.InvariantCulture)); } private static IEnumerable<int> ParseNetNative2(string text) { var numbersToParse = text.Split(separators, StringSplitOptions.RemoveEmptyEntries); var numbers = new int[numbersToParse.Length]; for (int i = 0; i < numbers.Length; ++i) numbers[i] = Int32.Parse(numbersToParse[i], NumberStyles.Number, CultureInfo.InvariantCulture); return numbers; }

Looks like you are missing GenerateRandomNumbers method for me to run your code. — CharlesNRice, CommentedOct 6, 2017 at 13:58
@CharlesNRice it's a local function of CreateTestString(). It's a C# 7 thing, if you're using a previous version you need to move it outside and to refactor the out variable in Int32Parser.Consume() — Adriano Repetti, CommentedOct 6, 2017 at 14:03
Since number = (number * numericBase) + digit; is running in a checked block, it will fail for numbers that are too big in absolute values (about int.MaxValue/10 or int.MinValue/10). I hope your input will not have values like that. — Igor Soloydenko, CommentedOct 6, 2017 at 21:18
Yes, it's checked to throw OverflowException for big numbers (outside Int32 range) but parsing one digit from left to right it works smoothly for the others (it's the first test in the small snippet). — Adriano Repetti, CommentedOct 6, 2017 at 22:10

t3chb0t · Accepted Answer · 2017-10-07 11:26:46Z

The algorithm seems to be fine and all members have nice and clear names so the code documents itself pretty well. However as a whole I don't like implementation. I find that the parser is a little bit messy and chaotic. I especially don't like these two fields that are modified all over the place by various methods and make the parser thread-unsafe.

private IEnumerator<char> _stream; private bool _hasMoreCharactersToConsume;

If I had one instance of the parser and called Consume in two different threads weird things would happen. You try to hide it by instantiating it via the static method Parse but the default constructor isn't private so theoretically it's possible to use it incorrectly.

I suggest the following changes:

It would be better to have only one instance method Parse and make the other ones static and pass the IEnumerator<char> and other required parameters to them so that they can advance it themselfes if necessary and be almost pure methods with the exception of advancing the enumerator.
It would be cleaner to call all methods TrySomething and consequently return a bool and the result via an out parameter because each of them can reach the end.
An enumerator is disposable so a using should be added.
I like the new switch with filters so I think it would be cleaner to use it nearly everywhere here.
If you want to add an index to the error message where the invalid charachter was found I would create a custom enumerator that provides an additional property like CurrentIndex that could be implemented as a private class of the parser.

Example:

public class Int32Parser { public IEnumerable<int> Parse(string text) { using (var enumerator = text.GetEnumerator()) { while (enumerator.MoveNext()) { if (IsWhiteSpaceCharacter(enumerator.Current)) { continue; } if (TryReadNumber(enumerator, out var number)) { yield return number; } else { yield break; } } } } private static bool TryReadNumber(IEnumerator<char> enumerator, out int number) { number = 0; if (!TryReadSign(enumerator, out var sign)) return false; if (!TryReadBasePrefix(enumerator, out var basePrefix)) return false; while (TryReadDigit(enumerator, out var digit)) { checked { number = (number * basePrefix) + digit; } if (!enumerator.MoveNext()) return false; } number = sign * number; return true; } private static bool TryReadSign(IEnumerator<char> enumerator, out int sign) { switch (enumerator.Current) { case '-': sign = -1; return enumerator.MoveNext(); case '+': sign = 1; return enumerator.MoveNext(); default: sign = 1; // assume its' positive return true; } } private static bool TryReadBasePrefix(IEnumerator<char> enumerator, out int basePrefix) { basePrefix = -1; switch (enumerator.Current) { case '0': if (!enumerator.MoveNext()) return false; switch (enumerator.Current) { case 'X': case 'x': basePrefix = 16; return enumerator.MoveNext(); case char c when '0' <= c && c <= '7': basePrefix = 8; return true; // don't move, it's a digit default: basePrefix = 10; // assume it's decimal return true; // don't move, it's a digit } default: basePrefix = 10; // assume it's decimal return true; // don't move, it's a digit } } private static bool TryReadDigit(IEnumerator<char> enumerator, out int digit) { switch (enumerator.Current) { case char c when (c >= '0' && c <= '9'): digit = c - '0'; return true; case char c when (c >= 'A' && c <= 'F'): digit = c - 'A' + 10; return true; case char c when (c >= 'a' && c <= 'f'): digit = c - 'a' + 10; return true; // we're done reading digits case char c when IsWhiteSpaceCharacter(enumerator.Current): digit = -1; return false; default: throw new FormatException($"Unknown character '{enumerator.Current}'."); } } private static bool IsWhiteSpaceCharacter(char c) { switch (c) { case ' ': case '\t': case '\r': case '\n': return true; default: return false; } } }

In my tests it was slightly faster then then original implementation although I'm not sure where the performance improvement actually comes from.

The slight performance gain is probably happening because everything is a static method. I find your switch statements more readable; but I'm not a big fan of continue in any code unless there is just no other way. — Svek, CommentedOct 7, 2017 at 17:44
@Svek right, an else would be better here. I'll fix that later as I'm not a big fan of continues either. — t3chb0t, CommentedOct 7, 2017 at 21:45
I'm not fully convinced about the enumerator parameter, I have a class then I'm usually happy to use instance fields to share variables [exactly because] used by all methods. About everything else...I love this implementation, it's much more easy to understand and to follow. Especially TryXyz() made code MUCH better than my original one (those MoveToNextCharacter() + TryMoveToNextCharacter() thrown all around really bothered me...) — Adriano Repetti, CommentedOct 9, 2017 at 7:29

Igor Soloydenko · Accepted Answer · 2017-10-06 22:16:22Z

You will not like my answer. TL;DR: I could not find a specific point where a code change would lead to performance improvement. I have, however, noticed some correctness issues.

Get "Enough" Test Samples

Below is the "test bench" I used to compile your solution before feeding it to a profiler. If you uncomment the console output instruction in the catch block, you'll be able to see some unhandled exception which could be caused by either bad input my code generated or a bug in the parser code.

public static class Program { public static void Main() { var errCount = 0; foreach (var line in Generate()) { try { foreach (var number in Int32Parser.Parse(line)) { //Console.WriteLine(line + " ----> " + number); } } catch (Exception) { errCount++; //Console.WriteLine(line); } } //Console.WriteLine(errCount); } static Random rnd = new Random(0); private static IEnumerable<string> Generate(int totalCount = 1024) { var result = new List<string>(); for (var counter = 1; counter <= totalCount; counter++) { var items = GenerateItems(); result.Add( string.Join(" ", items) ); } return result; } private static IEnumerable<string> GenerateItems(int totalItemCount = 1024) { var items = new List<string> { }; for (var counter = 1; counter <= totalItemCount; counter++) items.Add(Generators[rnd.Next(0, Generators.Length)]()); return items.OrderBy(_ => rnd.Next(0, items.Count)); } static int MaxValue = int.MaxValue / 11; static int MinValue = int.MinValue / 11; static Func<string>[] Generators = new Func<string>[] { () => rnd.Next(0, MaxValue).ToString(), () => "+" + rnd.Next(0, MaxValue).ToString(), () => rnd.Next(MinValue, -1).ToString(), () => 0.ToString(), () => "0x" + rnd.Next(0, int.MaxValue / 11).ToString("x"), () => "0" + Convert.ToString(rnd.Next(0, MaxValue), 8), () => "\n", () => "\t", () => "\r", () => " ", }; }

Performance Improvement should be Data Driven

After running profiler, I got a chart to work with. Mapping the invocation stats to the instructions in the source code did not give me many ideas. What's worse, the few ideas which popped out, are pretty crazy.

Please notice that this is generated by compiling into .NET Core dll. The .NET FW Classic results may be different.

I may be very wrong, but maybe, just maybe, materializing the IEnumerable<char> into a char[] and dealing with current and possible next character via indices is a faster solution. I really doubt that using the string's IEnumerator<char>::Current or IEnumerator<char>::MoveNext() is slow, but what if it is? With char array we know for sure that element-by-index access costs \$O(1)\$. Same is true with reach-line-end check -- \$O(1)\$ if implemented based on a known current index and total string length.

If your real code works with a string, I don't see how char[] would be a bad thing to do if it improves the performance. However, if the code is indeed a template for working with real char-producing streams/IEnumerables, it may lead to memory consumption problems.

Not an answer, really

The reason, I think this "optimization" will not work, is because we're talking about string and IEnumerable<T>. I don't believe we can possibly beat the implementation(s) of these fundamental pieces of .NET platform.

P.S.

As mentioned, the code looks pretty good to me. No obvious flaws are seen. But this is a highly algorithmic code in nature so to speak. Only an excessive set of unit- and integration tests will help getting more confidence with correctness of the solution. I'll just assume that the tests are in place and all green.

Hopefully, other reviewers will be able to provide more valuable feedback.

It's not a bad idea, here I was trying to fit few different use cases (input as string, StringBuilder and IEnumerable<char> from a blocking network stream) then I went with the maximum common divisorIEnumerable<char> but to squeeze better performance (I still need to measure in a real world usage scenario) I might have to go with specialized versions (unfortunately string and StringBuilder don't share any interface with indexer access) — Adriano Repetti, CommentedOct 9, 2017 at 7:33

Svek · Accepted Answer · 2017-10-07 17:39:53Z

You are essentially benchmarking against Int.Parse() --- which tends to have slower performance than hand-coded versions for specific purposes.

Here is an example that shows you that

private static _delimters = new char[] { ' ', '\t', '\n', '\r' }; // this method is exactly the same as your ParseNetNative2() method // except we are NOT using the Int32.Parse, instead we use the alternative one private static int ParseCustom2(string text) { var numbersToParse = text.Split(_delimters, StringSplitOptions.RemoveEmptyEntries); var numbers = new int[numbersToParse.Length]; for (int i = 0; i < numbers.Length; ++i) numbers[i] = CustomInt32Parse(numbersToParse[i]); // <-- not using Int32.Parse return numbers.Length; } //alternative method to Int32.Parse //credit: http://cc.davelozinski.com/c-sharp/fastest-way-to-convert-a-string-to-an-int private static int CustomInt32Parse(string s) { int result = 0; for (int i = 0; i < s.Length; i++) result = result * 10 + (s[i] - '0'); return result; }

With this approach I was getting generally better results than your current implementation.

// custom = Your ParseCustom() method // custom 2 = My ParseCustom2() method **above** Numbers with 1 digits Testing custom || Total elapsed time: 3791 ticks Testing custom 2 || Total elapsed time: 3611 ticks //<-- better Numbers with 2 digits Testing custom || Total elapsed time: 4455 ticks Testing custom 2 || Total elapsed time: 3792 ticks //<-- better (always) Numbers with 3 digits Testing custom || Total elapsed time: 5258 ticks Testing custom 2 || Total elapsed time: 4235 ticks //<-- better Numbers with 4 digits Testing custom || Total elapsed time: 6191 ticks Testing custom 2 || Total elapsed time: 6549 ticks //<-- ? Numbers with 5 digits Testing custom || Total elapsed time: 6927 ticks Testing custom 2 || Total elapsed time: 4762 ticks //<-- better (always) Numbers with 6 digits Testing custom || Total elapsed time: 7725 ticks Testing custom 2 || Total elapsed time: 7689 ticks //<-- better Numbers with 7 digits Testing custom || Total elapsed time: 8660 ticks Testing custom 2 || Total elapsed time: 6062 ticks //<-- better (always) Numbers with 8 digits Testing custom || Total elapsed time: 10111 ticks Testing custom 2 || Total elapsed time: 8362 ticks //<-- better (always) Numbers with 9 digits Testing custom || Total elapsed time: 11399 ticks Testing custom 2 || Total elapsed time: 8615 ticks //<-- better (always)

I think the only merit with my suggested solution is easier readability in terms of code to maintain. It has consistently beat your approach especially with 2, 5, 7, 8 and 9 digits.

However, according to the author of the above alternative to the Int32.Parse() there are some constraints to be aware of...
The alternative method assumes all numbers:
are positive
do not have thousandths separators or decimal points
contain no scientific or other special notations
and all conversions will happen with no exception testing because I’m just wanting to test the raw speed of converting.

Something very unusual, but your algorithm seems to always outperform mine, when dealing with 4 digits?

Stack Exchange Network

Parse a stream of space separated numbers

3 Answers 3

Get "Enough" Test Samples

Performance Improvement should be Data Driven

Not an answer, really

P.S.

Hot Network Questions

Parse a stream of space separated numbers

3 Answers 3

Get "Enough" Test Samples

Performance Improvement should be Data Driven

Not an answer, really

P.S.

Related

Hot Network Questions