command-line tool to sum the values in a column of a CSV file

Question

I am looking for a command-line tool to calculate the sum of the values in a specified column of a CSV file. (Update: The CSV file might have quoted fields, so a simple solution just to break on a delimiter (',') does not work.)

Given the following sample CSV file:

description A,description B,data 1, data 2 fruit,"banana,apple",3,17 veggie,cauliflower,7,18 animal,"fish,meat",9,22

I want to build the sum, for example, over the column data 1 with the result 19.

I have tried to use csvkit for this but didn't get very far. Are there other command-lien tools specialised in this CSV operation?

This question is similar to: Processing tabular data from the command-line. If you believe it’s different, please edit the question, make it clear how it’s different and/or how the answers on that question are not helpful for your problem. — muru, CommentedJul 2, 2024 at 1:52
Not exactly the same, but for the suggestions in the dupe, you just need to drop the groupby stuff for datamash and the reorder parts for miller (so, e.,g., datamash --header-in -st , sum 3, mlr --csv -N stats1 -a sum -f 3) — muru, CommentedJul 2, 2024 at 1:56
@muru Thanks for the link to other question. I missed saying that the CSV file might have quoted fields. - I updated the question (incl sample CSV) to make that clear. — halloleo, CommentedJul 2, 2024 at 3:58
@muru Your 2nd comment is great. Would you mind making this into an answer? — halloleo, CommentedJul 2, 2024 at 4:01

muru · Accepted Answer · 2024-07-02 04:23:34Z

Miller handles quoted CSVs natively, so the following should work:

mlr --csv --headerless-csv-output stats1 -a sum -f 'data 1'

user1683793 · Accepted Answer · 2024-07-02 04:35:07Z

I put your CSV into a file a.csv and did the summing as follows:

cat a.csv | awk -F, '{ if($3 !~ /data*/ ){ total+= $3} } END { print "sum=" total}'

This is using awk with a comma as a field separator. If field #3 does not match data*, we add the value to total. When done, print out "sum=" and the total value. This is a subset of the Processing Tabular Data awk answer.

An updated version that uses FPAT for to allow the quotes:

cat a.csv | awk -v FPAT='([^,]*)|("[^\"]+\")' '{ if($3 !~ /data*/ ){ total+= $3 } } END {print "sum="total}'

The FPAT is a regular expression for gawk (that won't work for old style awk) that specifies how the fields are defined. In this case, there are two regular expressions, ([^,]*) says zero or more occurrences of zero or more non-comma characters. the ("[\"]+") says one or more occurrences of non-quote characters, contained within quotes.

Thanks for the answer. My question was not detailed enough: I missed saying that the CSV file might have quoted fields. Sorry for this, @user1683793. - I updated the question (incl sample CSV) to make that clear. — halloleo, CommentedJul 2, 2024 at 4:00
@halloleo, I believe you will find this updated version satisfactory. I had not dealt with quoted strings in awk before. Interesting. — user1683793, CommentedJul 2, 2024 at 4:37
Instead of !~ /data/ I would just check NR>1 -- by convention the header is only one line and first. Or you can get away with no check, because a nonnumeric field is treated as zero in arithmetic. — dave_thompson_085, CommentedJul 16, 2024 at 1:37

Kusalananda · Accepted Answer · 2024-07-02 09:02:14Z

Using csvsql from the csvkit set of tools, which was what you originally tried to use:

$ csvsql -I --query 'SELECT SUM("data 1") FROM file' file | tail -n +2 19

This inserts the CSV data from file into a database table of the same name, without type inference (-I). It then applies the SQL statement SUM("table 1") on that table to get the sum of the table 1 field.

Since the output will contain a header, we strip it off with a call to tail.

There's also a --tables option that allows you to pass a "dummy" table name rather than having to hardcode the input filename as table name (which always bugs me) — steeldriver, CommentedJul 2, 2024 at 10:17

Ed Morton · Accepted Answer · 2024-07-15 17:15:57Z

Using GNU awk 5.3 or newer for -k (or equivalently --csv):

awk -k '{n+=$3} END{print n+0}' file

See https://stackoverflow.com/questions/45420535/whats-the-most-robust-way-to-efficiently-parse-csv-using-awk for more information on parsing CSV with awk.

@tink To ensure that the script produces the number 0 if the input is empty rather than a blank line. — Ed Morton, CommentedJul 15, 2024 at 17:44

nofinator · Accepted Answer · 2025-04-23 15:04:34Z

If you're willing to try csvkit again, it's possible using two simple steps. The first will extract just your data 1 column, then the second will output the sum of that column to the command line.

csvcut -c "data 1" file | csvstat --sum

For an enormous file, this won't be as efficient as some of the single step options above.

Good thinking! Thanks. There is more to csvkit than meets the eye. 😄 — halloleo, CommentedApr 23 at 22:17

Stack Exchange Network

command-line tool to sum the values in a column of a CSV file

5 Answers 5

You must log in to answer this question.

Linked

Hot Network Questions

command-line tool to sum the values in a column of a CSV file

5 Answers 5

You must log in to answer this question.

Linked

Related

Hot Network Questions