The following is my second attempt at a basic CSV parser in Ruby. I've addressed suggestions that I've got for my previous version here.
I have omitted all test cases - the code listing would be too long otherwise. You can look at the full code-base at GitHub if you want to.
As opposed to the last version, the current solution supports:
- multiple different value delimiters,
- values surrounded by quotes,
- delimiters as a part of quoted values,
- quotes as a part of quoted values.
Please note that my intentions are not to implement a full-featured CSV parser, but to improve my Ruby programming skills. I'll appreciate any suggestions.
csv_parser.rb
require_relative 'line_parser' # An exception to report that the given delimiter is not supported. # class UnsupportedDelimiterException < Exception attr_reader :delimiter def initialize delimiter @delimiter = delimiter end def to_s "Unsupported delimiter [#{@delimiter}]." end end # CSV file parser. # # USAGE: # CsvFile.new(';').parse('data.csv').each { |row| puts row.firstname } # class CsvFile SUPPORTED_DELIMITERS = [ ",", "|", ";", " ", " "] DEFAULT_DELIMITER = "," # Initializes the parser. # # * *Args* : # - +delimiter+ -> The character to be used as delimiter. Defaults to # DEFAULT_DELIMITER. Must be one of SUPPORTED_DELIMITERS, otherwise # an UnsupportedDelimiterException is raised. # def initialize delimiter = DEFAULT_DELIMITER if not SUPPORTED_DELIMITERS.include? delimiter raise UnsupportedDelimiterException.new delimiter end @line_parser = LineParser.new(Lexer.new delimiter) end # Parses the given CSV file and returns the result as an array of rows. # def parse file rows = [] headers = @line_parser.parse file.gets.chomp file.each do |line| values = {} headers.zip(@line_parser.parse line.chomp).each do |key, value| values[key] = value end rows << CsvRow.new(values) end rows end end # CSV row. # class CsvRow # Creates a new CSV row with the given values. # # * *Args* : # - +values+ -> a hash containing the column -> value mapping # def initialize values @values = values end # Returns the value in the column given as method name, or null if # no such value exists. # def method_missing name, *args @values[name.to_s] end end
line_parser.rb
require_relative 'lexer' class ParseError < RuntimeError end # CSV line parser. # class LineParser # Initializes the parser with the given lexer instance. # def initialize lexer @lexer = lexer end # Parses the given CSV line into a collection of values. # def parse line values = [] last_seen_identifier = false tokens = @lexer.tokenize line tokens.each do |token| case token when EOFToken if not last_seen_identifier values << "" end break when DelimiterToken if not last_seen_identifier values << "" next else last_seen_identifier = false end when IdentifierToken if last_seen_identifier raise ParseError, "Unexpected identifier - a delimiter was expected." end last_seen_identifier = true values << token.lexem end end values end end
lexer.rb
require_relative 'assertions' class LexicalError < RuntimeError end class Token end class EOFToken < Token def to_s "EOF" end end class DelimiterToken < Token attr_reader :lexem def initialize lexem @lexem = lexem end def to_s "DELIMITER(#{@lexem})" end end class IdentifierToken < Token attr_reader :lexem def initialize lexem @lexem = lexem end def to_s "IDENTIFIER(#{@lexem})" end end # CSV line lexical analyzer. # class Lexer # Initialzes the lexer. # # * *Args* : # - +delimiter+ -> The character to be used as delimiter. # def initialize delimiter @delimiter = delimiter end # Breaks the given CSV line into a sequence of tokens. # def tokenize stream stream = stream.chars.to_a tokens = [] while true tokens << next_token(stream) if tokens.last.is_a? EOFToken break end end tokens end private def next_token stream char = stream.shift case char when eof EOFToken.new when delimiter DelimiterToken.new delimiter when quotes stream.unshift char get_quoted_identifier stream else stream.unshift char get_unquoted_identifier stream end end def get_unquoted_identifier stream lexem = "" while true do char = stream.shift case char when delimiter stream.unshift char return IdentifierToken.new lexem when eof return IdentifierToken.new lexem else lexem << char end end end def get_quoted_identifier stream char = stream.shift assert { char == quotes } lexem = "" while true do char = stream.shift case char when eof raise LexicalError, "Unexpected EOF within a quoted string." when quotes if stream.first == quotes lexem << quotes stream.shift else return IdentifierToken.new lexem end else lexem << char end end end def eof nil end def delimiter @delimiter end def quotes '"' end end
assertions.rb
# An exception to represent assertion violations. # class AssertionError < RuntimeError end # Evaluates the given block as a boolean expression and throws an AssertionError # in case it results to false. # def assert &block raise AssertionError unless yield end