Lazy String.Split

Question

C#'s String.Split method comes from C# 2.0, and lazy operations weren't a feature back then. The task is to split a string according to a (single) separator. Doing so with String.Split is used like

string[] split = myString.Split(new string[] { separator });

Now, not that bad, but if you want to add more operations to that string[] (and you probably do), you'll need to loop over the whole array, basically iterating the string twice. Using coroutine-like behaviour of the lazy yield keyword, you can (maybe) do more than one operation while only iterating once over the string.

public static IEnumerable<string> LazySplit(this string stringToSplit, string separator) { if (stringToSplit == null) throw new ArgumentNullException("stringToSplit"); if (separator == null) throw new ArgumentNullException("separator"); var lastIndex = 0; var index = -1; do { index = stringToSplit.IndexOf(separator, lastIndex); if (index < 0 && lastIndex != stringToSplit.Length) { yield return stringToSplit.Substring(lastIndex); yield break; } else if (index >= lastIndex) { yield return stringToSplit.Substring(lastIndex, index - lastIndex); } lastIndex = index + separator.Length; } while (index > 0); }

While this does not have the "remove empty entries" option, using myString.LazySplit(separator).Where(str => !String.IsNullOrWhiteSpace(str)) should do the job with an O(n) operation, or am I wrong here?

I'm not sure about the time complexity using co-routines, but for the functionality I've written some unit tests to be sure its working:

[TestMethod] public void LazyStringSplit() { var str = "ab;cd;;"; var resp = str.LazySplit(";"); var expected = new[] { "ab", "cd", "" }; var result = resp.ToArray(); CollectionAssert.AreEqual(expected, result); } [TestMethod] public void LazyStringSplitEmptyString() { var str = ""; var resp = str.LazySplit(";"); var expected = new string[0]; var result = resp.ToArray(); CollectionAssert.AreEqual(expected, result); } [TestMethod] public void LazyStringSplitWithoutEmpty() { var str = "ab;cd;;"; var resp = str.LazySplit(";").Where(s => !string.IsNullOrWhiteSpace(s)); var expected = new[] { "ab", "cd" }; var result = resp.ToArray(); CollectionAssert.AreEqual(expected, result); } [TestMethod] public void LazyStringSplitNoSplit() { var str = "ab;cd;;"; var resp = str.LazySplit(" "); var expected = new[] { "ab;cd;;" }; var result = resp.ToArray(); CollectionAssert.AreEqual(expected, result); }

I don't think this works the way you want it to, ";abc".LazySplit(";") returns an empty sequence. — mjolka, CommentedMar 15, 2015 at 23:12
@mjolka yeah, you're right, should've written better unit tests. — Mephy, CommentedMar 15, 2015 at 23:15
In which case, it isn't working yet and is off-topic till it is, sadly. You need to get it to function as intended first. Fixing that may even lead you to enlightenment, ofc. — itsbruce, CommentedMar 15, 2015 at 23:28
@itsbruce As it was an edge-case, I'd say this question is still on-topic. Feel free to review it. AFAIK, the fix is a very simple one to apply. > 0 to >= 0, right? — Simon Forsberg, CommentedMar 15, 2015 at 23:40

mjolka · Accepted Answer · 2015-03-16 00:10:07Z

Edge cases:

";abc".LazySplit(";") will return an empty sequence. To match the behaviour of ";abc".Split(new char[] { ';' }) it should return the sequence { "", "abc" }.
";abc".LazySplit("") will return a sequence with a single item, the empty string. To match the behaviour of ";abc".Split(new char[] { }) it should return the sequence { ";abc" }.

Here's how I would suggest writing it.

First, deal with the empty separator

if (separator.Length == 0) { yield return value; yield break; }

Then have two variables, start and end that refer to the start and end of the substring we want to extract.

var start = 0; for (var end = value.IndexOf(separator); end != -1; end = value.IndexOf(separator, start)) { yield return value.Substring(start, end - start); start = end + separator.Length; } yield return value.Substring(start);

To make your unit tests match the behaviour of string.Split, you also want to change LazyStringSplit to have

var expected = new[] { "ab", "cd", "", "" };

and LazyStringSplitEmptyString to have

var expected = new string[] { "" };

If you want to test that your implementation matches the behaviour of string.Split, I would suggest introducing a helper method for the tests. Something like

var expected = value.Split(new string[] { separator }, StringSplitOptions.None); CollectionAssert.AreEqual(expected, value.LazySplit(separator));

I'll keep in mind to check more aggresively for edge-cases. Thanks for the insight. — Mephy, CommentedMar 16, 2015 at 1:39

svick · Accepted Answer · 2015-04-11 19:54:47Z

you'll need to loop over the whole array, basically iterating the string twice

Iterating twice doesn't have to be slower than iterating once, if iterating once is more complicated. When it comes to time complexity, both options are \$O(n)\$. When it comes to actual performance, you need to measure. (And that's assuming that performance of this code actually matters.)

Specifically, arrays are very efficient in .Net, whereas iterating IEnumerable requires two virtual calls for every item.

Stack Exchange Network

Lazy String.Split

2 Answers 2

Hot Network Questions

Lazy String.Split

2 Answers 2

Related

Hot Network Questions