I need to parse a stream of characters IEnumerable<char>
as a list of space (and newline) separated numbers.
Numbers can be expressed in base 10, base 8 (with 0
prefix) and base 16 (with 0x
prefix). They can be both positive and negative (also for hex numbers then, for example, -0xff
is a valid input), leading sign is optional.
FormatException
should be thrown for invalid numbers and OverflowException
should be thrown for out of Int32
range numbers. Parsing does not need to support thousands separator and input characters are always in the ASCII range.
Parsing has to be as fast as possible but without sacrificing readability too much.
Implementation (few more helper methods are omitted) is:
public sealed class Int32Parser { public static IEnumerable<int> Parse(string text) => new Int32Parser().Consume(text); public IEnumerable<int> Consume(IEnumerable<char> stream) { if (stream == null) throw new ArgumentNullException(nameof(stream)); _stream = stream.GetEnumerator(); _hasMoreCharactersToConsume = TryMoveToNextCharacter(); while (_hasMoreCharactersToConsume) { if (ParseNextNumber(out int number)) yield return number; } } private IEnumerator<char> _stream; private bool _hasMoreCharactersToConsume; private char CurrentCharacter => _stream.Current; private bool ParseNextNumber(out int number) { number = 0; if (!SkipLeadingWhiteSpaces()) return false; int sign = ParseOptionalSign(); int numericBase = ParseOptionalBasePrefix(); while (!IsWhiteSpaceOrNewLine(CurrentCharacter)) { int digit = ParseDigit(CurrentCharacter); if (digit >= numericBase) throw new FormatException($"Invalid digit {digit} for base {numericBase}."); checked { number = (number * numericBase) + digit; } if (!TryMoveToNextCharacter()) break; } number = sign * number; return true; } private bool SkipLeadingWhiteSpaces() { while (IsWhiteSpaceOrNewLine(CurrentCharacter)) { if (!TryMoveToNextCharacter()) return false; } return true; } private int ParseOptionalSign() { if (CurrentCharacter == '-') { MoveToNextCharacter(); return -1; } if (CurrentCharacter == '+') MoveToNextCharacter(); return 1; } private int ParseOptionalBasePrefix() { if (CurrentCharacter != '0') return 10; if (!TryMoveToNextCharacter()) return 10; if (CurrentCharacter != 'x' && CurrentCharacter != 'X') return 8; MoveToNextCharacter(); return 16; } private int ParseDigit(char c) { if (c >= '0' && c <= '9') return c - '0'; if (c >= 'A' && c <= 'F') return c - 'A' + 10; if (c >= 'a' && c <= 'f') return c - 'a' + 10; throw new FormatException($"Unknown digit {c}"); } private static bool IsWhiteSpaceOrNewLine(char c) => c == ' ' || c == '\t' || c == '\n' || c == '\r'; private bool TryMoveToNextCharacter() => _hasMoreCharactersToConsume = _stream.MoveNext(); private void MoveToNextCharacter() { if (!TryMoveToNextCharacter()) throw new FormatException("Expected more characters."); } }
A simple test:
foreach (var number in Int32Parser.Parse(" 2147483647 -1 -2 3 0xa 0xb 010 0 0x \n+1 +100 -33 \n")) Console.WriteLine(number);
Direct performance comparison isn't fair (for Int32.Parse()
) because requirements are pretty different (and the test measures also quality of my own implementation together with String.Split()
) but here a small quick-and-dirty performance comparison:
static class Program { static void Main(string[] args) { foreach (int numberOfDigits in Enumerable.Range(1, 9)) Compare(numberOfDigits); } private static void Compare(int numberOfDigits) { Console.WriteLine("Numbers with {0} digits", numberOfDigits); var testString = CreateTestString(numberOfDigits); Console.WriteLine(" Testing native 1"); Console.WriteLine(" Total elapsed time: {0} ticks", Measure(testString, ParseNetNative1)); Console.WriteLine(" Testing native 2"); Console.WriteLine(" Total elapsed time: {0} ticks", Measure(testString, ParseNetNative2)); Console.WriteLine(" Testing custom"); Console.WriteLine(" Total elapsed time: {0} ticks", Measure(testString, ParseCustom)); } private static string CreateTestString(int numberOfDigits) { int minValue = (int)Math.Pow(10, numberOfDigits) / 10; int maxValue = (int)Math.Pow(10, numberOfDigits + 1) - 1; var rnd = new Random(); return String.Join(" ", GenerateRandomNumbers()) + "\n" + String.Join(" ", GenerateRandomNumbers()); IEnumerable<int> GenerateRandomNumbers() => Enumerable.Range(1, 10000).Select(x => rnd.Next(minValue, minValue) * ((rnd.Next() & 1) == 1 ? 1 : -1)); } private static long Measure(string text, Func<string, int> method) { const int repetitions = 1000; int result = method(text); var stopwatch = new Stopwatch(); stopwatch.Start(); for (int i=0; i < repetitions; ++i) result = method(text); stopwatch.Stop(); return stopwatch.ElapsedTicks / repetitions; } private static int ParseNetNative1(string text) { return text.Split(new char[] { ' ', '\t', '\n', '\r' }, StringSplitOptions.RemoveEmptyEntries) .Select(x => Int32.Parse(x, NumberStyles.Number, CultureInfo.InvariantCulture)) .Count(); } private static int ParseNetNative2(string text) { var numbersToParse = text.Split(new char[] { ' ', '\t', '\n', '\r' }, StringSplitOptions.RemoveEmptyEntries); var numbers = new int[numbersToParse.Length]; for (int i = 0; i < numbers.Length; ++i) numbers[i] = Int32.Parse(numbersToParse[i], NumberStyles.Number, CultureInfo.InvariantCulture); return numbers.Length; } private static int ParseCustom(string text) { return Int32Parser.Parse(text).Count(); } }
Int32.Parse()
does not support negative hex numbers then they're excluded from this comparison but I suppose a similar figure. A quick plot follows; numbers are in ticks for a long string made of 20k numbers with N digit, roughly 50% of them are negative, first 10k separated with one space then one new line and then another 10 k separated with two spaces (see CreateTestString()
):
Parsing a string (without hex negative numbers) is not the only use-case but it's an important one, if TestNative1()
or TestNative2()
are obviously inefficient then I'm open to critiques also about them. Here a reference implementation as used in tests:
private const char[] Separators = new char[] { ' ', '\t', '\n', '\r' }; private static IEnumerable<int> ParseNetNative1(string text) { return text.Split(separators, StringSplitOptions.RemoveEmptyEntries) .Select(x => Int32.Parse(x, NumberStyles.Number, CultureInfo.InvariantCulture)); } private static IEnumerable<int> ParseNetNative2(string text) { var numbersToParse = text.Split(separators, StringSplitOptions.RemoveEmptyEntries); var numbers = new int[numbersToParse.Length]; for (int i = 0; i < numbers.Length; ++i) numbers[i] = Int32.Parse(numbersToParse[i], NumberStyles.Number, CultureInfo.InvariantCulture); return numbers; }
CreateTestString()
. It's a C# 7 thing, if you're using a previous version you need to move it outside and to refactor the out variable inInt32Parser.Consume()
\$\endgroup\$number = (number * numericBase) + digit;
is running in achecked
block, it will fail for numbers that are too big in absolute values (aboutint.MaxValue/10
orint.MinValue/10
). I hope your input will not have values like that.\$\endgroup\$