Very simple CSV-parser in Java

Question

Please take a look at my method for parsing a CSV string.

I am looking for a simple approach without using an external library.

Is throwing a RuntimeException inside the stream appropriate?

import java.util.ArrayList; import java.util.Arrays; import java.util.List; public class MyCsvParser { public static String[][] parse(final String str, final int cols) { List<String[]> list = new ArrayList<>(); str.lines().forEach(l -> { String[] split = l.split(","); if (split.length != cols) { throw new RuntimeException(); } list.add(split); }); return list.toArray(new String[0][]); } public static void main(String[] args) { System.out.println(Arrays.deepToString(parse("bla,blaa\nblub,blubb", 2))); } }

What is a "CSV String"? RFC 4180? In which case this code is obviously simplistic and can't parse a CSV. — Boris the Spider, CommentedSep 15, 2024 at 18:50
@BoristheSpider Not all edge cases of RFC 4180 must be observed, instead the processing should simply be aborted. This is the meaning of "very simple". So the method parses a string correctly according to the requirements. — Tobias Grothe, CommentedSep 15, 2024 at 19:21
Change your test string to "bla, \"bl,ah\"" and see why rolling your own CSV parser is almost always a mistake. Also, not all CSV files use a comma as the separator, many non-english countries use a semi-colon. — Neil, CommentedSep 16, 2024 at 8:44
You're not parsing CSV, you're parsing a very small subset of CSV. You should define exactly what you support (which is not CSV). Otherwise it's impossible to state whether your implementation is correct (it is definitely not correct if you should be able to parse any random CSV file). Having quotes or commas in a cell inside a CSV file is not "an edge case", it happens very often. Having line-feeds/carriage returns is probably a little less common but it does also happen. — jcaron, CommentedSep 16, 2024 at 14:04
"Having quotes or commas in a cell inside a CSV file is not "an edge case"". As I see it, this is an edge case. I don't have to provide a solution for every "valid" CSV format. Parsing can simply be stopped in such cases. — Tobias Grothe, CommentedSep 16, 2024 at 15:09

Reinderien · Accepted Answer · 2024-09-16 12:55:20Z

Side-effects

Stream.forEach() operation should be utilized with care since it operates via side-effects and should not be used as a substitution of a proper reduction operation.

The way you've written this stream is discouraged by the Stream API documentation because it makes code more cluttered and difficult to follow and more importantly your solution is broken with parallel streams (there should be no assumptions on the nature of the stream in your code).

In this particular case, you should be using Stream.toArray() operation instead.

Multiline lambdas

Try to void them. They bear a lot of cognitive load. If a lambda expression requires several lines or when you have a complex single-line lambda (e.g. with a nested stream in it), consider introducing a method.

Exceptions

In short, the purpose of exceptions is to indicate cases when it's not possible to proceed with the normal execution flow.

If you stumbled on a corrupt piece of data which violates invariant that are important for your business logic, usually you don't want to proceed processing it. That's a valid case to throw. You asked can you "throw from a stream"? Sure, it's just a means iteration.

^{I've seen some bickering over whether it's appropriate to use exceptions for the purpose of validation. Sure it is, we do employ Exceptions for Validation for decades.}

^{Unless you're using exceptions to avoid conditional logic, or to make weird hacks like throwing in order to break from a recursive method, and you have a genuine invalid piece of data on your hands you can and should throw.}

Another, important note: exceptions should be informative. If standard exception types can describe the case at hand, fine, if not introduce your own exception type.

Also, use proper exception messages that will be helpful in investigating the issue.

Static routines

Don't treat everything as util classes, use the Power of object-orientation to make more clean, cohesive and testable.

Refactored version

public class ArrayParser { private final String separator; private final int columnCount; public ArrayParser(String separator, int columnCount) { this.separator = separator; this.columnCount = columnCount; } public String[][] parse(final String str) { return str.lines() .map(this::parseLine) .toArray(String[][]::new); } private String[] parseLine(String toParse) { String[] line = toParse.split(separator); validateLine(line); return line; } private void validateLine(String[] line) { if (line.length != columnCount) { throw new LineParsingException(line, columnCount); } } }

Exception example:

private class LineParsingException extends RuntimeException { private static final String MESSAGE_TEMPLATE = """ The actual number of columns in the line %s doesn't match the expected number of columns %d"""; public LineParsingException(String[] line, int columnsExpected) { super(MESSAGE_TEMPLATE.formatted(Arrays.toString(line), columnsExpected)); } }

@tgrothe ParseException has a couple of downsides: it's a checked exception and it's very general. It's common practice to avoid checked exceptions while describing cases of abnormal behavior in your domain space because they lead to leaky abstractions and pollute your code. Some people here might disagree, but it's a general consensus is that introduction of checked exceptions was an unsuccessful experiment in language modeling (that's the opinion expressed by Java Champions and established Java community members). — Alexander Ivanchenko, CommentedSep 15, 2024 at 11:25
@tgrothe columnLength this name is misleading, it might suggest a column is expected to contain a certain number of characters, or something along these lines. Something like columnCount will do a better job at communicating that the purpose of the field is to describe the expected number columns. — Alexander Ivanchenko, CommentedSep 15, 2024 at 11:32
ArrayParser => Ach Du Lieber! What such a anti-PoOO class name. The eponymous method is tolerable only because it a constructor. Double-plus good for having one. From the last few years of CodeReview reviewing I thought constructors were banned. PoOO is merely a way to quality code. The WAY is shut without constructors. P.S. plus one. — radarbob, CommentedSep 15, 2024 at 20:37
@radarbob And if an instance of a class requires external data for initialization, it should be provided explicitly through a constructor (not via setters). Because it's a responsibility of constructors to produce a fully initialized class instance. — Alexander Ivanchenko, CommentedSep 18, 2024 at 2:14
provided explicitly through a constructor. Yes. I always imagine some coder who is not me using that class. If you're not constructor-ing in these simple cases what will happen when it's a hierarchal composition of a dozen related classes. Makes all the effort of evolving beyond assembler seem like a waste. Well, I still have my IBM 360 assembler coding card just in case. — radarbob, CommentedSep 18, 2024 at 7:51

J_H · Accepted Answer · 2024-09-17 22:21:43Z

conservative design

Since this is billed as "a CSV parser", a caller may reasonably believe they could send it any *.csv file produced by Excel. Better to advertise it as MyRestrictedCsvParser. The /** javadoc */ comments should explain the restrictions.

Each field may or may not be enclosed in double quotes

This library should probably throw a fatal error upon encountering an ASCII 34 " double quote anywhere in an input line. Then a caller would not accidentally consume a data file in the belief that it had been parsed one way when in fact the library parsed it another way. That is, part of scoping down requirements is reducing the space of inputs you're willing to claim you successfully processed.

informative diagnostic

Throwing an unchecked exception within the JVM is great. It makes your library easier for callers to consume.

 throw new RuntimeException();

This is not a very diagnostic error. It needs two improvements:

Subclass RuntimeException to create a library-specific error, perhaps CsvParseException.
Mention the values of split.length and cols in the message, to save a maintenance engineer a little effort in diagnosing and repairing buggy inputs.

Consider keeping track of which line number we're on, so that can be included in the diagnostic message.

A caller should not be forced to catch a generic RuntimeException to recover from an error it knows how to deal with. We define new app-specific exception types to permit fine-grained catching. Lumping "wrong column count", "found a quote", and "zero lines" together would be acceptable, at least until you see how callers actually behave. If it turns out that callers really do wish to distinguish between those errors, then a v2 library release could always offer finer granularity on the error types.

signature

Clearly the OP code works. It seems slightly less convenient for the caller than it might be. There is redundant information encoded in the str and cols parameters.

Consider setting cols based on number of fields found in the first line of input.

It is true, I assumed strings without double quotes at all. This deliberately contradicts point 5 of the standard or specification, and should be mentioned in a Java doc. — Tobias Grothe, CommentedSep 16, 2024 at 8:48

DuncG · Accepted Answer · 2024-09-21 07:42:03Z

It is fine to raise runtime exception, but it is more helpful to add a specific exception message. I would suggest small simplification removing ArrayList, replacing the forEach with map, and stream directly to array:

public static String[][] parse(final String str, final int cols) { return str.lines().map(line -> line.split(",")) .map(sa -> { if (sa.length != cols) { throw new IllegalArgumentException("Mismatched column size: "+Arrays.deepToString(sa)); } return sa; }).toArray(String[][]::new); }

Adapting the above to return Stream<String[]> as suggested by user286335 will make memory use easier if you read huge input files, as you can process while reading rather than afterwards.

You could also parse without pre-defining the expected column count by addition of a groupingBy collector, and raise exception when detecting inconsistent number of columns across all rows:

public static String[][] parse(final String str) { Map<Integer, List<String[]>> map = str.lines().map(line -> line.split(",")) .collect(Collectors.groupingBy(sa -> Integer.valueOf(sa.length))); // Could handle empty data here, or treat as error as below // if (map.size() == 0) return new String[0][]; if (map.size() != 1) throw new IllegalArgumentException("No data or mismatched row sizes: "+map.keySet()); // Return the first (and only) entry, all rows have same number of columns: return map.values().iterator().next().toArray(new String[0][]); }

@AlexanderIvanchenko Good idea, I overlooked the default groupingBy behaviour, amended — DuncG, CommentedSep 21, 2024 at 7:43

user286335user286335 · Accepted Answer · 2024-09-15 14:37:18Z

Without throwing an exception for a different number of columns than the give cols bound the code could get less complex to...

public static String[][] parseIt(final String str, final int cols) { return str.lines() .map(line -> line.split(",")) .filter(columns -> columns.length == cols) .toArray(String[][]::new); }

...if ignoring the incomplete or exceeded number of columns is an option, otherwise adjust the filtering to throw an exception in case of unfitted number of columns.

Returning an array of arrays of String bounds the rest of the code using the method to convert the String[][] into an Iterator or a Stream to go over its content, returning a Stream could improve the overall complexity and reduce the code lines of parse method...

public static Stream<String[]> parse(final String str, final int cols) { return str.lines() .map(line -> line.split(",")) .filter(columns -> columns.length == cols); }

"Without throwing an exception for a different number of columns than the give cols bound the code could get less complex" - you conflate altering the requirements and reducing code complexity. — Alexander Ivanchenko, CommentedSep 15, 2024 at 14:43
Not giving any indication that the given data is corrupt and silently processing the only records with the valid number of columns is a different problem from the one that OP's code solves. — Alexander Ivanchenko, CommentedSep 15, 2024 at 14:48
But ... returning a Stream<String[]> instead of the entire string array is a valid point. Even if, strictly spoken, this has changed the requirements or problem (also) a bit. — Tobias Grothe, CommentedSep 15, 2024 at 16:13

Stack Exchange Network

Very simple CSV-parser in Java

4 Answers 4

Side-effects

Multiline lambdas

Exceptions

Static routines

Refactored version

conservative design

informative diagnostic

signature

Hot Network Questions

Very simple CSV-parser in Java

4 Answers 4

Side-effects

Multiline lambdas

Exceptions

Static routines

Refactored version

conservative design

informative diagnostic

signature

Related

Hot Network Questions