3
\$\begingroup\$

The following is my second attempt at a basic CSV parser in Ruby. I've addressed suggestions that I've got for my previous version here.

I have omitted all test cases - the code listing would be too long otherwise. You can look at the full code-base at GitHub if you want to.

As opposed to the last version, the current solution supports:

  • multiple different value delimiters,
  • values surrounded by quotes,
  • delimiters as a part of quoted values,
  • quotes as a part of quoted values.

Please note that my intentions are not to implement a full-featured CSV parser, but to improve my Ruby programming skills. I'll appreciate any suggestions.


csv_parser.rb

require_relative 'line_parser' # An exception to report that the given delimiter is not supported. # class UnsupportedDelimiterException < Exception attr_reader :delimiter def initialize delimiter @delimiter = delimiter end def to_s "Unsupported delimiter [#{@delimiter}]." end end # CSV file parser. # # USAGE: # CsvFile.new(';').parse('data.csv').each { |row| puts row.firstname } # class CsvFile SUPPORTED_DELIMITERS = [ ",", "|", ";", " ", " "] DEFAULT_DELIMITER = "," # Initializes the parser. # # * *Args* : # - +delimiter+ -> The character to be used as delimiter. Defaults to # DEFAULT_DELIMITER. Must be one of SUPPORTED_DELIMITERS, otherwise # an UnsupportedDelimiterException is raised. # def initialize delimiter = DEFAULT_DELIMITER if not SUPPORTED_DELIMITERS.include? delimiter raise UnsupportedDelimiterException.new delimiter end @line_parser = LineParser.new(Lexer.new delimiter) end # Parses the given CSV file and returns the result as an array of rows. # def parse file rows = [] headers = @line_parser.parse file.gets.chomp file.each do |line| values = {} headers.zip(@line_parser.parse line.chomp).each do |key, value| values[key] = value end rows << CsvRow.new(values) end rows end end # CSV row. # class CsvRow # Creates a new CSV row with the given values. # # * *Args* : # - +values+ -> a hash containing the column -> value mapping # def initialize values @values = values end # Returns the value in the column given as method name, or null if # no such value exists. # def method_missing name, *args @values[name.to_s] end end 

line_parser.rb

require_relative 'lexer' class ParseError < RuntimeError end # CSV line parser. # class LineParser # Initializes the parser with the given lexer instance. # def initialize lexer @lexer = lexer end # Parses the given CSV line into a collection of values. # def parse line values = [] last_seen_identifier = false tokens = @lexer.tokenize line tokens.each do |token| case token when EOFToken if not last_seen_identifier values << "" end break when DelimiterToken if not last_seen_identifier values << "" next else last_seen_identifier = false end when IdentifierToken if last_seen_identifier raise ParseError, "Unexpected identifier - a delimiter was expected." end last_seen_identifier = true values << token.lexem end end values end end 

lexer.rb

require_relative 'assertions' class LexicalError < RuntimeError end class Token end class EOFToken < Token def to_s "EOF" end end class DelimiterToken < Token attr_reader :lexem def initialize lexem @lexem = lexem end def to_s "DELIMITER(#{@lexem})" end end class IdentifierToken < Token attr_reader :lexem def initialize lexem @lexem = lexem end def to_s "IDENTIFIER(#{@lexem})" end end # CSV line lexical analyzer. # class Lexer # Initialzes the lexer. # # * *Args* : # - +delimiter+ -> The character to be used as delimiter. # def initialize delimiter @delimiter = delimiter end # Breaks the given CSV line into a sequence of tokens. # def tokenize stream stream = stream.chars.to_a tokens = [] while true tokens << next_token(stream) if tokens.last.is_a? EOFToken break end end tokens end private def next_token stream char = stream.shift case char when eof EOFToken.new when delimiter DelimiterToken.new delimiter when quotes stream.unshift char get_quoted_identifier stream else stream.unshift char get_unquoted_identifier stream end end def get_unquoted_identifier stream lexem = "" while true do char = stream.shift case char when delimiter stream.unshift char return IdentifierToken.new lexem when eof return IdentifierToken.new lexem else lexem << char end end end def get_quoted_identifier stream char = stream.shift assert { char == quotes } lexem = "" while true do char = stream.shift case char when eof raise LexicalError, "Unexpected EOF within a quoted string." when quotes if stream.first == quotes lexem << quotes stream.shift else return IdentifierToken.new lexem end else lexem << char end end end def eof nil end def delimiter @delimiter end def quotes '"' end end 

assertions.rb

# An exception to represent assertion violations. # class AssertionError < RuntimeError end # Evaluates the given block as a boolean expression and throws an AssertionError # in case it results to false. # def assert &block raise AssertionError unless yield end 
\$\endgroup\$

    2 Answers 2

    4
    \$\begingroup\$

    if not

    It is very 'unruby' to use if not. Ruby has a special unless just for these cases:

     unless last_seen_identifier values << "" end break 

    You can even shorten this further by using the one-liner shorthand:

    values << "" unless last_seen_identifier break 

    One more general guideline - although you can have an unless .. else .. end construct, it is not recommended. Better to reverse the conditional:

    # not good if not last_seen_identifier values << "" next else last_seen_identifier = false end # not better unless last_seen_identifier values << "" next else last_seen_identifier = false end # better if last_seen_identifier last_seen_identifier = false else values << "" next end # best values << "" unless last_seen_identifier last_seen_identifier = false 

    Too Many Classes

    Contrary to languages like Java or C#, Ruby is not statically typed, but rather duck typed, which means that you can pass any type anywhere, and the objects don't have to inherit from the same base class.

    For example, the class Token adds no functionality, and it is not used anywhere in the code besides being inherited by a bunch of classes. It is totally redundant, and deleting it will not affect the code.

    Furthermore, EOFToken and DelimiterToken don't contain any intrinsic functionality, and according to ruby's idioms, a better solution would be to simply pass symbols (:eof and :delimiter) rather than classes.

    The last token (IdentifierToken) holds the current value, but you can even skip that, simply passing the value itself instead.

    This will make the token iteration look like this:

    tokens.each do |token| case token when :eof values << "" unless last_seen_identifier break when :delimiter values << "" unless last_seen_identifier last_seen_identifier = false else if last_seen_identifier raise ParseError, "Unexpected identifier - a delimiter was expected." end last_seen_identifier = true values << token end end 

    Assertions?

    Assertions are used by some languages to make some development checks where a condition is assumed to be always true. This methodology is not longer en-vogue in most places, and even where it is - it is also disabled in production:

    Most languages allow assertions to be enabled or disabled globally, and sometimes independently. Assertions are often enabled during development and disabled during final testing and on release to the customer. Not checking assertions avoids the cost of evaluating the assertions while (assuming the assertions are free of side effects) still producing the same result under normal conditions. Under abnormal conditions, disabling assertion checking can mean that a program that would have aborted will continue to run. This is sometimes preferable.

    In your code the need for the assertion seems a bit artificial, since it checks that the first character is exactly the character pushed back by the caller:

    ... when quotes stream.unshift char get_quoted_identifier stream ... char = stream.shift assert { char == quotes } 

    that being the case, I would argue that you can remove the unshift/shift tango, and simply assume that the first char is quotes:

    ... when quotes get_quoted_identifier stream ... char = quotes 
    \$\endgroup\$
      1
      \$\begingroup\$

      Your UnsupportedDelimiterException class is very clean. I particularly like that you over rode the to_s method so that it returns something meaningful. It's a little detail that often gets over looked. Well done.

      That said, I'm not sure what you're really getting from the ParseError class.

      class ParseError < RuntimeError end 

      Yes, it's nice to get a parse error rather than a runtime error, but that's all you're gaining. It seems a little silly to inherit from a class and then have the exact same functionality.

      \$\endgroup\$
      3
      • 1
        \$\begingroup\$I disagree. It is a very good idea to have separate error classes. It allows you to rescue parse errors only. If you didn't do this, you'd have to rescue all StandardErrors, check the error message and re-raise the exception if you cannot handle it. It's not about functionality in this case, but about providing context and semantics.\$\endgroup\$CommentedDec 16, 2014 at 15:03
      • \$\begingroup\$That's a very valid point @britishtea. Feel free to downvote my answer if you think it's bad advice.\$\endgroup\$CommentedDec 16, 2014 at 15:05
      • \$\begingroup\$This is indeed bad advice. If someone would be using this code and receive only RuntimeError it would be harder to find what the problem is. I understand your point and would agree if would have been something like class SpecificParseError < ParseError that would maybe not have helped, but giving context to the caller is important.\$\endgroup\$CommentedAug 12, 2015 at 12:40

      Start asking to get answers

      Find the answer to your question by asking.

      Ask question

      Explore related questions

      See similar questions with these tags.