4

I'm dealing with a series of bed files, which look like this:

chr1 100 110 0.5 chr1 150 175 0.2 chr1 200 300 1.5 

With the columns being chromosome, start, end, score. I have multiple different files with different scores in each one, and I'd like to combine them like this:

> cat a.bed chr1 100 110 0.5 chr1 150 175 0.2 chr1 200 300 1.5 > cat b.bed chr1 100 110 0.4 chr1 150 175 0.7 chr1 200 300 0.9 > cat c.bed chr1 100 110 1.5 chr1 150 175 1.2 chr1 200 300 0.1 > cat combined.bed chr1 100 110 0.5 0.4 1.5 chr1 150 175 0.2 0.7 1.2 chr1 200 300 1.5 0.9 0.1 

All the score columns (last column of the file) are added to a single file. I found this answer, which can combine a column from one additional file into an existing file, but I would like a command which can add a variable number of columns together. So if I have 10 bed files to combine, I'd like a command that can process them all together and create a single file with 10 score columns.

Each file should have the same number of lines, and each entry should have the same coordinates in all the files, so there should be no conflicts there. However there can be a lot of entries in each of the files (100K or more generally), so I'd like to avoid processing each one multiple times.

Is there a way to handle this cleanly? This will be in a script so no need to be a one liner.

4
  • How are the files ordered? Your output seems to indicate the column output of a.bed should follow b.bed? Are those sample file names?
    – Inian
    CommentedFeb 24 at 15:14
  • @Inian order is more or less irrelevant. It would be nice to get the names of the files used in a header, users can then refer to this header to determine where each signal is coming from.
    – Whitehot
    CommentedFeb 24 at 15:52
  • 1
    @Whitehot Are the input files always ordered in the same way, with the same number of lines in each file? I.e., would we be able to combine record "n" in one file with the same record "n" in another file by line number, always?
    – Kusalananda
    CommentedFeb 24 at 16:15
  • @Kusalananda yes, the entries in each file should all be ordered the same
    – Whitehot
    CommentedFeb 24 at 17:18

3 Answers 3

5

Using any paste and any awk assuming all files have the same key values and there are no empty files:

$ paste *.bed | awk -v ORS= '{print $1, $2, $3; for (i=4; i<=NF; i+=4) print OFS $i; print RS}' chr1 100 110 0.5 0.4 1.5 chr1 150 175 0.2 0.7 1.2 chr1 200 300 1.5 0.9 0.1 

If you want a header line for that you could run this first:

$ printf 'Chr Start Stop'; printf ' %s' *.bed; printf '\n' Chr Start Stop a.bed b.bed c.bed 

Alternatively, just using any awk even if you don't always have the same keys across every file and even if some files are empty and adding a header:

$ cat tst.awk { key = $1 OFS $2 OFS $3 } !seen[key]++ { keys[++numKeys] = key } { vals[key,FILENAME] = $4 } END { numFiles = ARGC - 1 printf "Chr" OFS "Start" OFS "Stop" for ( fileNr=1; fileNr<=numFiles; fileNr++ ) { fileName = ARGV[fileNr] printf "%s%s", OFS, fileName } print "" for ( keyNr=1; keyNr<=numKeys; keyNr++ ) { key = keys[keyNr] printf "%s", key for ( fileNr=1; fileNr<=numFiles; fileNr++ ) { fileName = ARGV[fileNr] val = vals[key,fileName] printf "%s%s", OFS, val } print "" } } 

$ awk -f tst.awk a.bed b.bed c.bed Chr Start Stop a.bed b.bed c.bed chr1 100 110 0.5 0.4 1.5 chr1 150 175 0.2 0.7 1.2 chr1 200 300 1.5 0.9 0.1 

Piping the output to column -t would be 1 of many alternatives if you want columnar output, e.g.:

$ awk -f tst.awk a.bed b.bed c.bed | column -t Chr Start Stop a.bed b.bed c.bed chr1 100 110 0.5 0.4 1.5 chr1 150 175 0.2 0.7 1.2 chr1 200 300 1.5 0.9 0.1 
    3

    The following uses Miller (mlr) to rename the first three nameless fields chr, start, and stop, and then adds an extra field called file holding the filename that the current record was read from. It then applies a reshape "long-to-wide" operation to pivot the dataset using the newly added file field as the name field for the value in the 4th (still nameless) field.

    mlr --nidx \ label chr,start,stop then \ put '$file = FILENAME' then \ reshape -s file,4 \ [abc].bed 

    The input is assumed to be space-separated, as in the question, but it does not have to be sorted. The command generates space-separated output:

    chr1 100 110 0.5 0.4 1.5 chr1 150 175 0.2 0.7 1.2 chr1 200 300 1.5 0.9 0.1 

    (The label command is actually totally unnecessary unless you want the output in some format that supports headers, as shown below.)

    For pretty-printed output, use --n2p in place of --nidx:

    $ mlr --n2p label chr,start,stop then put '$file = FILENAME' then reshape -s file,4 [abc].bed chr start stop a.bed b.bed c.bed chr1 100 110 0.5 0.4 1.5 chr1 150 175 0.2 0.7 1.2 chr1 200 300 1.5 0.9 0.1 

    Use --n2c for CSV output.

    $ mlr --n2c label chr,start,stop then put '$file = FILENAME' then reshape -s file,4 [abc].bed chr,start,stop,a.bed,b.bed,c.bed chr1,100,110,0.5,0.4,1.5 chr1,150,175,0.2,0.7,1.2 chr1,200,300,1.5,0.9,0.1 
    1
    • Miller seems like a really cool tool, I'll definitely be investigating it more as I tend to process a lot of csv-like files at work.
      – Whitehot
      CommentedFeb 25 at 10:21
    3

    Using Raku (formerly known as Perl_6)

    Create a hash with Chr/start/stop as key, push on value:

    ~$ raku -ne 'BEGIN my %hash; %hash.push: .split(/ \s+ <?before <[.0..9-]>+ $ > /, 2); END .say for %hash.sort;' *.bed chr1 100 110 => [0.5 0.4 1.5] chr1 150 175 => [0.2 0.7 1.2] chr1 200 300 => [1.5 0.9 0.1] 

    Clean up output (using put instead of say gives tab-separated key/value pairs):

    ~$ raku -ne 'BEGIN my %hash; %hash.push: .split(/ \s+ <?before <[.0..9-]>+ $ > /, 2); END .put for %hash.sort;' *.bed chr1 100 110 0.5 0.4 1.5 chr1 150 175 0.2 0.7 1.2 chr1 200 300 1.5 0.9 0.1 

    Same as above, except join key/value pairs on whitespace:

    ~$ raku -ne 'BEGIN my %hash; %hash.push: .split(/ \s+ <?before <[.0..9-]>+ $ > /, 2); END .kv.join(" ").put for %hash.sort;' *.bed chr1 100 110 0.5 0.4 1.5 chr1 150 175 0.2 0.7 1.2 chr1 200 300 1.5 0.9 0.1 

    Raku is a programming language in the Perl family that features a clean Regex syntax. Above .split is used to cleave each input line into 2 fragments, deleting whitespace before the final column. These are pushed onto a hash, so the values accumulate. Upon output you add .sort to the end of the code (sorts on keys by default).

    Basically the .split function looks for the pattern / \s+ <?before <[.0..9-]>+ $ > /, which is whitespace before any combination of 0..9 decimal digits, . dot, and - minus sign at the $ end-of-line (omit - minus sign if negative numbers are not possible in this column). Technically, the <?before … > syntax is Raku's notation for a zero-width positive lookahead ( "X before Y" ). Another approach below.

    Split-off last column without using regexes:

    ~$ raku -ne 'BEGIN my %hash; %hash.push: .flip.split(/ \s+ /, 2).reverse.map: *.flip; END .kv.join(" ").put for %hash.sort;' *.bed chr1 100 110 0.5 0.4 1.5 chr1 150 175 0.2 0.7 1.2 chr1 200 300 1.5 0.9 0.1 

    EDIT: Add header row with file-names as value columns:

    ~$ raku -ne 'BEGIN my (%fh, %hash); %hash.push: "Chr start stop" => $*ARGFILES.Str unless %fh{$*ARGFILES.Str}++; %hash.push: .split(/ \s+ <?before <[.0..9-]>+ $ > /, 2); END .kv.join(" ").put for %hash.sort;' *.bed 

    Returns:

    Chr start stop a.bed b.bed c.bed chr1 100 110 0.5 0.4 1.5 chr1 150 175 0.2 0.7 1.2 chr1 200 300 1.5 0.9 0.1 

    https://docs.raku.org/language/regexes
    https://docs.raku.org/type/Str#routine_split
    https://raku.org

    4
    • How memory intensive is storing everything as a hash? The input files should be ordered the same so it seems like overkill for this use case
      – Whitehot
      CommentedFeb 25 at 10:22
    • @Whitehot what are your memory requirements/limitations? The Raku answer above is less memory-intensive than storing data in an array, and (unless I'm mistaken) I don't see another answer that's less memory intensive.CommentedFeb 25 at 15:27
    • 1
      Perhaps I'm just mistaken about how memory usage in Perl works, but I thought a hash was generally more memory intensive than an array. I expect a few hundred thousand entries in each text file, so nothing a modern laptop can't handle in any case, but I was just curious :)
      – Whitehot
      CommentedFeb 25 at 16:17
    • 1
      Well, now I'm curious. The hash data structure here is actually a String key and an Array value, per row (i.e. Chr start stop) entry. I just assumed the mini-Arrays created (max 10 elements each) would allow greater memory efficiency. However, the computationally intensive part of the task is looking up the key to push on the correct/corresponding score--and this is really where hashes shine. Cheers.CommentedFeb 25 at 16:51

    You must log in to answer this question.

    Start asking to get answers

    Find the answer to your question by asking.

    Ask question

    Explore related questions

    See similar questions with these tags.