0

We have a large tsv file where the data for a single row are splitted into different rows with new line delimiter.

We need to join them together based on the tab counts.

For eg: If suppose the total tab count for a single row is 995 , the data is split in between as follows,

Row Number Tab Count Row 1 660 Row 2 0 Row 3 300 Row 5 20 Row 6 15 Total 995 

N.B The above row split is not consitent and varies.

I want to add the tab counts and once we get 995 as total , need to join the data from different rows into one single row.

We have the below command to join the lines based on new line delimiter.

paste -sd '\n' inputfile > output file 

I want to know,

  1. If we can get the tab counts for different rows
  2. Add the tab counts to get sum of 995
  3. Once the sum is achieved , whichever tab counts were added from those rows , need to be joined into one single row.

Please let me know if this can be achieved using shell script.

Thanks.!

3
  • It sounds like you are trying to repair a file with newlines in their data fields so that each line has a predefined number of columns. Is the data not properly quoted? If it is, it may still be a perfectly valid CSV file with tabs as field delimiters. CSV is allowed to contain newlines within quoted fields and a CSV parser will be able to handle the data.
    – Kusalananda
    CommentedDec 11, 2018 at 6:43
  • This is a TSV file we are working on and the data is not quoted and only separated by tabs and terminating with new line
    – v.rajan
    CommentedDec 11, 2018 at 6:46
  • Sorry, your description doesn't make any sense to me. Have you combined the raw data and the desired result into on file above? And what is the single row then supposed to look like?
    – tink
    CommentedDec 11, 2018 at 7:39

2 Answers 2

1

As always with these type of questions, it would be better to correct the process that creates the data in the first place rather than appending a post-processing stage to the process. Having said that, here's what you can do.

$ cat file 1 2 3 1 2 3 1 2 3 
$ awk -v w=3 -f script.awk file 1 2 3 1 2 3 1 2 3 

This awk script will collect the tab-delimited fields from the input until a preset number of fields have been collected. It will then output these collected fields as its own line before continuing reading from the input.

The number of fields in the output is given by the value of w, which is passed on the command line as shown above. Note that this is the number of fields, not the number of tab characters.

BEGIN { OFS = FS = "\t" } function output_line () { # a function that outputs the nf elements in the array a # separated by OFS (tab) and terminated by ORS (newline) for (j = 1; j < nf; ++j) printf("%s%s", a[j], OFS) printf("%s%s", a[nf], ORS) } { # a: an array of fields that we want to output together # nf: the length of that array # just add each field to the a array for (i = 1; i <= NF; ++i) { a[++nf] = $i # if enough has been read, output the collected data if (nf == w) { output_line() nf = 0 } } } END { # output any data remaining in a if (nf > 0) output_line() } 

This is the same as

tr '\t' '\n' <file | paste - - - 

for my small example. In your case, you could use the awk script above with -v w=996, or you could type the tr+paste command with 996 dashes.

    0

    Would continued reading lines until reaching field count help? From another post:

    awk -F'\t' ' {while (NF<996) {getline X $0 = $0 FS X } } 1 ' file 
    2
    • The issue with this may be that you overshoot. It depends on whether the input data always contains a newline at the true record boundaries or not.
      – Kusalananda
      CommentedDec 11, 2018 at 10:26
    • Absolutely. I take from the question that we have always an exact count match. Additional care needs to be taken if not.
      – RudiC
      CommentedDec 11, 2018 at 10:33

    You must log in to answer this question.

    Start asking to get answers

    Find the answer to your question by asking.

    Ask question

    Explore related questions

    See similar questions with these tags.