0

I am looking for a command-line tool to calculate the sum of the values in a specified column of a CSV file. (Update: The CSV file might have quoted fields, so a simple solution just to break on a delimiter (',') does not work.)

Given the following sample CSV file:

description A,description B,data 1, data 2 fruit,"banana,apple",3,17 veggie,cauliflower,7,18 animal,"fish,meat",9,22 

I want to build the sum, for example, over the column data 1 with the result 19.

I have tried to use csvkit for this but didn't get very far. Are there other command-lien tools specialised in this CSV operation?

4
  • 2
    This question is similar to: Processing tabular data from the command-line. If you believe it’s different, please edit the question, make it clear how it’s different and/or how the answers on that question are not helpful for your problem.
    – muru
    CommentedJul 2, 2024 at 1:52
  • 3
    Not exactly the same, but for the suggestions in the dupe, you just need to drop the groupby stuff for datamash and the reorder parts for miller (so, e.,g., datamash --header-in -st , sum 3, mlr --csv -N stats1 -a sum -f 3)
    – muru
    CommentedJul 2, 2024 at 1:56
  • @muru Thanks for the link to other question. I missed saying that the CSV file might have quoted fields. - I updated the question (incl sample CSV) to make that clear.
    – halloleo
    CommentedJul 2, 2024 at 3:58
  • @muru Your 2nd comment is great. Would you mind making this into an answer?
    – halloleo
    CommentedJul 2, 2024 at 4:01

5 Answers 5

4

Miller handles quoted CSVs natively, so the following should work:

mlr --csv --headerless-csv-output stats1 -a sum -f 'data 1' 
    2

    I put your CSV into a file a.csv and did the summing as follows:

    cat a.csv | awk -F, '{ if($3 !~ /data*/ ){ total+= $3} } END { print "sum=" total}' 

    This is using awk with a comma as a field separator. If field #3 does not match data*, we add the value to total. When done, print out "sum=" and the total value. This is a subset of the Processing Tabular Data awk answer.

    An updated version that uses FPAT for to allow the quotes:

    cat a.csv | awk -v FPAT='([^,]*)|("[^\"]+\")' '{ if($3 !~ /data*/ ){ total+= $3 } } END {print "sum="total}' 

    The FPAT is a regular expression for gawk (that won't work for old style awk) that specifies how the fields are defined. In this case, there are two regular expressions, ([^,]*) says zero or more occurrences of zero or more non-comma characters. the ("[\"]+") says one or more occurrences of non-quote characters, contained within quotes.

    3
    • Thanks for the answer. My question was not detailed enough: I missed saying that the CSV file might have quoted fields. Sorry for this, @user1683793. - I updated the question (incl sample CSV) to make that clear.
      – halloleo
      CommentedJul 2, 2024 at 4:00
    • @halloleo, I believe you will find this updated version satisfactory. I had not dealt with quoted strings in awk before. Interesting.CommentedJul 2, 2024 at 4:37
    • 1
      Instead of !~ /data/ I would just check NR>1 -- by convention the header is only one line and first. Or you can get away with no check, because a nonnumeric field is treated as zero in arithmetic.CommentedJul 16, 2024 at 1:37
    2

    Using csvsql from the csvkit set of tools, which was what you originally tried to use:

    $ csvsql -I --query 'SELECT SUM("data 1") FROM file' file | tail -n +2 19 

    This inserts the CSV data from file into a database table of the same name, without type inference (-I). It then applies the SQL statement SUM("table 1") on that table to get the sum of the table 1 field.

    Since the output will contain a header, we strip it off with a call to tail.

    1
    • There's also a --tables option that allows you to pass a "dummy" table name rather than having to hardcode the input filename as table name (which always bugs me)CommentedJul 2, 2024 at 10:17
    2

    Using GNU awk 5.3 or newer for -k (or equivalently --csv):

    awk -k '{n+=$3} END{print n+0}' file 

    See https://stackoverflow.com/questions/45420535/whats-the-most-robust-way-to-efficiently-parse-csv-using-awk for more information on parsing CSV with awk.

    3
    • What's the purpose of the n+0?
      – tink
      CommentedJul 15, 2024 at 17:38
    • @tink To ensure that the script produces the number 0 if the input is empty rather than a blank line.
      – Ed Morton
      CommentedJul 15, 2024 at 17:44
    • 1
      Ah - thanks - that makes sense.
      – tink
      CommentedJul 15, 2024 at 17:48
    1

    If you're willing to try csvkit again, it's possible using two simple steps. The first will extract just your data 1 column, then the second will output the sum of that column to the command line.

    csvcut -c "data 1" file | csvstat --sum 

    For an enormous file, this won't be as efficient as some of the single step options above.

    1
    • Good thinking! Thanks. There is more to csvkit than meets the eye. 😄
      – halloleo
      CommentedApr 23 at 22:17

    You must log in to answer this question.

    Start asking to get answers

    Find the answer to your question by asking.

    Ask question

    Explore related questions

    See similar questions with these tags.