1
\$\begingroup\$

I (ruby beginner) have a performance issue with the following ruby code. The idea is to aggregate a CSV into a smaller, better-to-handle file and calculating the average of some values. Basically the same code in groovy runs about 30 times faster. And I can't believe that Ruby is that slow! ;-)

Some background: The file consists of lines like this (it is an output file of a JMeter performance test):

timeStamp|elapsed|label|responseCode|responseMessage|threadName|dataType|success|failureMessage|bytes|Latency 2013-05-17_16:30:11.261|4779|Session_Cookie|200|OK|Thread-Gruppe 1-1|text|true||21647|4739 

All values of e.g. a certain minute are selected by looking at the first 16 characters of the timestamp:

2013-05-17_16:30:11.261 => 2013-05-17_16:30 

I wanted to collect values in buckets/slices which are represented by the shortened timestamp and the label (third column). The value of the second column ("elapsed") is summed up.

require 'csv' require_relative 'slice' # Parameters: <precision> <input file> [<output file>] precision = ARGV[0].to_i input_file = ARGV[1] output_file = ARGV[2] time_slices = Hash.new CSV.foreach(input_file, {:col_sep => '|', :headers => :first_line}) do |row| current_time_slice = row['timeStamp'][0, precision] if time_slices[current_time_slice] == nil time_slices[current_time_slice] = Hash.new end if time_slices[current_time_slice][row['label']] time_slices[current_time_slice][row['label']].put_line(row) else new_slice = Slice.new(current_time_slice, row['label']) new_slice.put_line(row) time_slices[current_time_slice][row['label']] = new_slice end end out = File.new(output_file, 'a') out.puts 'time|label|elapsed_average' time_slices.values.each do |time_slice| time_slice.values.each do |slice| out.puts slice.aggregated_row end end 

The slice class looks like this:

class Slice attr_accessor :slice_timestamp, :slice_label, :lines, :sum, :count def initialize(slice_timestamp, slice_label) @slice_timestamp = slice_timestamp @slice_label = slice_label @count = 0 @sum = 0 end def put_line(line) @sum = @sum + line[1].to_i @count = @count + 1 end def average @sum / @count end def aggregated_row @slice_timestamp + '|' + @slice_label + '|' + average.to_s end end 

I think that I chose a quite straightforward and non-optimized approach, but still - the same approach is much faster in Groovy. What can be the reason for that?

\$\endgroup\$
4
  • \$\begingroup\$You don't post the groovy code nor the input file (or did I miss the links?) so we cannot compare. Anyway, I don't see anything out of the usual in your code (regarding efficiency, it could be indeed rewritten in a more idiomatic way). Ruby is slow, a very slow language, although 30x vs another dynamic language seems too much.\$\endgroup\$
    – tokland
    CommentedJul 15, 2013 at 20:11
  • \$\begingroup\$I put the Groovy code in this gist: gist.github.com/marvin9000/6006367 (it evolved a little bit in the meantime, but I think the basic idea is still the same). If you want to run it on some testdata, you can download it here: filehosting.org/file/details/435358/input.csv.tgz\$\endgroup\$
    – Elbonian
    CommentedJul 16, 2013 at 7:37
  • \$\begingroup\$One more addition: I ran both scripts on the sample data again and the difference was "only" about 3x (6 sec vs. 23 sec). It was more significant on larger files (400+ MB), which I don't want to upload here.\$\endgroup\$
    – Elbonian
    CommentedJul 16, 2013 at 7:43
  • \$\begingroup\$For the record - ruby version: ruby 1.9.3p286 (2012-10-12 revision 37165) [x86_64-darwin11.4.2], Groovy Version: 2.1.3 JVM: 1.6.0_51 Vendor: Apple Inc. OS: Mac OS X\$\endgroup\$
    – Elbonian
    CommentedJul 16, 2013 at 7:47

1 Answer 1

2
\$\begingroup\$
if time_slices[current_time_slice] == nil time_slices[current_time_slice] = Hash.new end 

It's more idomatic to write

time_slices[current_time_slice] ||= {} 

In general don't use Hash.new if you can just write {}. You can rid of that statement alltogether though if you declare the time_slices variable with

time_slices = Hash.new { |h, k| h[k] = {} } 

instead of time_slices = Hash.new.

if time_slices[current_time_slice][row['label']] time_slices[current_time_slice][row['label']].put_line(row) else new_slice = Slice.new(current_time_slice, row['label']) new_slice.put_line(row) time_slices[current_time_slice][row['label']] = new_slice end 

Also

time_slices[current_time_slice][row['label']] ||= Slice.new ... 

is more idiomatic here. But there is more to say here:

  1. Don't forget that Ruby is an interpreted language in principle (although the byte compiler of Ruby 1.9 relativizes that), so don't repeat complex references like time_slices[current_time_slice][row['label']]. But that's not only an performance but even more so an readability issue.
  2. Why does a Slice instance has to know the current_time_slice and label? That's completely redundant, since that information is already in the Hash. Get rid of that.

Considering this I would write the loop like this:

time_slices = Hash.new { |h, k| h[k] = Hash.new { |h, k| h[k] = Slice.new } } CSV.foreach(input_file, col_sep: '|', headers: :first_line) do |row| current_time_slice = row['timeStamp'][0, precision] time_slices[current_time_slice][row['label']].put_line(row) end 

Of course you then have to modify your output loop a little

time_slices.each do |slice_timestamp, time_slice| time_slice.each do |label, slice| out.puts [slice_timestamp, label, slice.average].join('|') end end 

So you don't need the Slice#aggregated_row method anymore. But that's a good idea anyway: e.g. why does Slice has to know about your pipe separator? That's not a good separation of concerns.

Note that I avoid the + operator here. That would create a new String instance with every + call. Alternatively you can also try <<, but be aware that this modifies the leftmost string (but oftentimes that doesn't matter).

In contrast there is no penalty in using + in groovy normally, because modern JVMs are using StringBuilder under the hood for it when they see fit. But in Ruby you still should be aware of its implications.

attr_accessor :slice_timestamp, :slice_label, :lines, :sum, :count 

Why are you declaring these accessors? They are never used.

\$\endgroup\$

    Start asking to get answers

    Find the answer to your question by asking.

    Ask question

    Explore related questions

    See similar questions with these tags.