2

I'd like help writing a sed command to replace the ID's in one file with names from a different file. I have a file that looks like this:

a b nSites J9 0 1 3092845 1 0 2 3139733 1 0 3 3339810 1 0 4 3124263 1 

The first two columns, a and b, are the ID's. Each number corresponds to a sample, there are 0-99 samples. I need a sed command to replace all the numbers in the first two columns with names from a name file. The name file looks like this:

SF10 0 SF11 1 SF12 2 SF13 3 SF14 4 

I really need a command to automate this, so I don't have to change each number by hand.

4
  • 2
    Please edit the question to limit it to a specific problem with enough detail to identify an adequate answer.
    – CommunityBot
    CommentedOct 15, 2024 at 22:30
  • 3
    Please show the raw data, or at least a small example set, and the corresponding required output. Ensure these can be copied and pasted, so not as pictures. You've said nothing about how the columns of numbers are separated so be sure to mention that when you provide your example. Finally please show what code you've already triedCommentedOct 16, 2024 at 0:09
  • Could you be more explicit about your 'name file' ? Is it a one-to-one mapping of a unique alphanumeric value (first column) to a unique numeric value (second column)? Thx.CommentedOct 16, 2024 at 11:51
  • 1
    sed would be a bad choice for this but it'd be trivial in awk (another mandatory POSIX tool available on all Unix boxes, just like sed is). Does the solution need to use sed?
    – Ed Morton
    CommentedOct 16, 2024 at 12:00

5 Answers 5

1

I'm assuming that both input files are tab-delimited, and that the first file contains a header but that the second one (with the mapping between names and IDs) does not.

Using Miller (mlr) we can then perform two relational JOIN operations to map the names from the name list to the data:

$ mlr --tsv --from data.tsv join --implicit-tsv-header -f names.tsv -j 2 -r b --lp b then join --implicit-tsv-header -f names.tsv -j 2 -r a --lp a then cut -x -f 2 then rename a1,a,b1,b a b nSites J9 SF10 SF11 3092845 1 SF10 SF12 3139733 1 SF10 SF13 3339810 1 SF10 SF14 3124263 1 

The command by itself:

mlr --tsv --from data.tsv \ join --implicit-tsv-header -f names.tsv -j 2 -r b --lp b then \ join --implicit-tsv-header -f names.tsv -j 2 -r a --lp a then \ cut -x -f 2 then \ rename a1,a,b1,b 

This first joins names.tsv and its 2nd field with data.tsv and its b field. The second join is similar but joins the 2nd field with the a field. Each join uses --lp to prepend a b (1st join) and an a (2nd join) to the non-join fields of the names.tsv file (i.e. only the 1st field, yielding fields called a1 and b1). This is done to avoid overwriting the 1 field from the first join with that from the second join.

The cut and rename operations remove the unneeded 2 field and change the prefixed 1 fields a1 and b1 to a and b.

Note that the join verb of Miller, in contrast to the standard Unix utility with the same name, does not require the join fields to be sorted. If the fields are sorted (lexically like in standalone join, not numerically), you may possibly speed up the operation slightly by adding the -s option (or --sorted-input) to the subcommand.

0
    1

    sed would not be the tool that comes to mind for this kind of task. There are dedicated tools such as mlr to work with tabular formats such as your tsvs as @Kusalananda has shown.

    But it is possible in sed when using BRE to perform some mapping through the use of back references if you add the mapping at the end of the pattern space, and look for the term to replace in that mapping using a back reference to a capture of that term.

    printf 'PROCESSED\nNAMES\n' | sed ' :slurp $! { N b slurp } :A_loop # 1 2 3 4 5 6 7 8 9 s/\(.*\n\)\([^[:space:]]\{1,\}\)\([[:blank:]].*\n\)\(PROCESSED\n\)\(\(.*\n\)\{0,1\}NAMES\n\(.*\n\)\{0,1\}\([^[:space:]]\{1,\}\)[[:blank:]]\{1,\}\2\(\n.*\)\{0,1\}\)$/\1\4\8\3\5/ t A_loop /\n.*\nPROCESSED\n/ { s/.*/Could not map all "a" fields/ q } s/\(\nPROCESSED\)\(\n.*\)\(\nNAMES\n\)/\2\1\3/ :B_loop # 1 2 3 4 5 6 7 8 9 s/\(.*\n\)\([^[:space:]]\{1,\}[[:blank:]]\{1,\}\)\([^[:space:]]\{1,\}\)\([[:blank:]].*\n\)\(PROCESSED\n\)\(\(.*\n\)\{0,1\}NAMES\n\(.*\n\)\{0,1\}\([^[:space:]]\{1,\}\)[[:blank:]]\{1,\}\3\(\n.*\)\{0,1\}\)$/\1\5\2\9\4\6/ t B_loop /\n.*\nPROCESSED/ { s/.*/Could not map all "b" fields/ q } s/\nPROCESSED\(\n\)/\1/ s/\nNAMES\n.*//' file.tsv - name.tsv 

    That assumes the whole input (both files plus the PROCESSED and NAMES delimiters) is small enough to fit in sed's pattern space.

      0

      Not sure if sed is the right tool, but you can do this using awk. Starting with your sample input files:

      $ cat mainlist a b nSites J9 0 1 3092845 1 0 2 3139733 1 0 3 3339810 1 0 4 3124263 1 $ cat keyvals SF10 0 SF11 1 SF12 2 SF13 3 SF14 4 

      Then:

      $ awk 'FNR==NR{a[$2]=$1;next};{$1=(a[$1]=="" ? $1:a[$1]); ($2=(a[$2]=="" ? $2:a[$2]));print}' keyvals mainlist a b nSites J9 SF10 SF11 3092845 1 SF10 SF12 3139733 1 SF10 SF13 3339810 1 SF10 SF14 3124263 1 

      (the format has changed but you can easily get what you want by changing the print format)

      Keyvals file will be read in the first pass to create a look up dictionary in a[] array. On the second pass the main file is read and the first and second fields of each line are substituted (only if they appear in the dictionary).

      This should work even if you have multiple entries anywhere in your mainlist that do not appear in the dictionary (keyvals file).

      1
      • a[$1]=="" will add an entry to a[] for every $1 from mainlist and so use up memory unnecessarily which may be a concern if that file is large. If you use $1 in a instead then you won't have that issue and it'll execute faster as then awk's just doing a hash lookup instead of a hash lookup plus a string comparison. Ditto for $2 obviously. Swap the order of the ternary operands after making that change of course.
        – Ed Morton
        CommentedOct 16, 2024 at 15:38
      0

      I see in the edit history of the question that in a previous incarnation of the question the OP showed sample input as screenshots of what appears to be Excel spreadsheets and someone else actually replaced those with the current space-separated textual input example. Given that, I'm going to assume the input will actually be comma-separated since that's the most common Excel export/import format in which case it'd look like:

      $ head file{1,2} ==> file1 <== a,b,nSites,J9 0,1,3092845,1 0,2,3139733,1 0,3,3339810,1 0,4,3124263,1 ==> file2 <== SF10,0 SF11,1 SF12,2 SF13,3 SF14,4 

      With the above input, using any awk:

      $ awk 'BEGIN{FS=OFS=","} NR==FNR{map[$2]=$1; next} FNR>1{$1=map[$1]; $2=map[$2]} 1' file2 file1 a,b,nSites,J9 SF10,SF11,3092845,1 SF10,SF12,3139733,1 SF10,SF13,3339810,1 SF10,SF14,3124263,1 

      If the input truly is space-separated (tabs and/or blanks), though, then still using any awk:

      $ awk 'NR==FNR{map[$2]=$1; next} FNR>1{$1=map[$1]; $2=map[$2]} 1' file2 file1 a b nSites J9 SF10 SF11 3092845 1 SF10 SF12 3139733 1 SF10 SF13 3339810 1 SF10 SF14 3124263 1 

      To get the output to look tabular there's various approaches including just piping to column -t:

      $ awk 'NR==FNR{map[$2]=$1; next} FNR>1{$1=map[$1]; $2=map[$2]} 1' file2 file1 | column -t a b nSites J9 SF10 SF11 3092845 1 SF10 SF12 3139733 1 SF10 SF13 3339810 1 SF10 SF14 3124263 1 

      or making the output tab-separated instead of blank-separated:

      $ awk -v OFS='\t' 'NR==FNR{map[$2]=$1; next} FNR>1{$1=map[$1]; $2=map[$2]} {$1=$1} 1' file2 file1 a b nSites J9 SF10 SF11 3092845 1 SF10 SF12 3139733 1 SF10 SF13 3339810 1 SF10 SF14 3124263 1 

      or using printf:

      $ awk 'NR==FNR{map[$2]=$1; next} FNR>1{$1=map[$1]; $2=map[$2]} {printf "%-10s %-10s %-10s %-10s\n", $1, $2, $3, $4}' file2 file1 a b nSites J9 SF10 SF11 3092845 1 SF10 SF12 3139733 1 SF10 SF13 3339810 1 SF10 SF14 3124263 1 
      3
      • The first revision was a TSV (which you can see in the source; annoyingly the stackexchange markdown renderer changes those to spaces and even more annoying with tabstops every 4 columns instead of usual 8), then the OP changed to a visual representation in a spreadsheet software possibly because they hadn't figured out how to use code blocks. In any case, mlr can work with csv or tsv (and several other formats).CommentedOct 16, 2024 at 15:42
      • You can test and see that if file1 includes lines whose first/second columns do not appear in file2, then the output will be corrupted.
        – userene
        CommentedOct 17, 2024 at 10:17
      • @userene the OP hasn't shown that in their example so a) it probably can't happen and b) if it can happen they haven't shown how they want it to be handled (leave the existing value alone, replace it with N/A, delete the row, replace it with something else, prefix it with some string, print a warning to stderr, stop processing, etc., etc.). So anything we do in that situation is probably unnecessary and we don't know what might be considered correct behavior.
        – Ed Morton
        CommentedOct 17, 2024 at 10:24
      0

      Using Raku (formerly known as Perl_6)

      ~$ raku -e 'my %names; #declare hash to map names (key/value pairs) for "/path/to/names.txt".IO.lines() { %names.push: .split(/ \s+ /).reverse }; #take file off command line, output header line lines[0].split(/\s+/).join("\t").put; #take file off command line, output data rows for lines() { .split(/ \s+ /) andthen put join "\t", %names{.[0]}, %names{.[1]}, .[2..*] };' target.txt 

      OR:

      ~$ raku -e 'my %names; #declare hash to map names (key/value pairs) for "/path/to/names.txt".IO.lines() { %names.push: .split(/ \s+ /).reverse }; #take file off command line, output header/data using && || operators for lines() { my $ln = .split(/ \s+ /); put $++ == 0 && $ln.join("\t") || join "\t", %names{$ln.[0]}, %names{$ln.[1]}, $ln.[2..*] };' target.txt 

      Here's an answer written in Raku, a member of the Perl-family of programming languages. This answer assumes that the names.txt file contains two columns with unique one-to-one (bijective) pairs.

      • A %names hash is declared. For the names.txt file the path is taken as a string, converted to an IO object, then lines are iterated through. Each line is split on whitespace, reversed, and the key-value pair pushed onto the %names hash.

      • Now, taking the target.txt file off the command line, lines are read in. In the first example, the header line (lines[0]) is split/joined, and output. In the second example, an anonymous $++ counter is used in conjunction with && and || short-circuiting operators to accomplish the same task: output the header line properly formatted.

      • In the final join "\t" clause, the input column(s) are decoded using the %names hash (e.g. %names{.[0]} / %names{$ln.[0]}, etc., appropriately joined and output.

      Sample Input:

      ~$ cat names.txt SF10 0 SF11 1 SF12 2 SF13 3 SF14 4 ~$ cat target.txt a b nSites J9 0 1 3092845 1 0 2 3139733 1 0 3 3339810 1 0 4 3124263 1 

      Sample Output:

      a b nSites J9 SF10 SF11 3092845 1 SF10 SF12 3139733 1 SF10 SF13 3339810 1 SF10 SF14 3124263 1 

      https://docs.raku.org
      https://raku.org

        You must log in to answer this question.

        Start asking to get answers

        Find the answer to your question by asking.

        Ask question

        Explore related questions

        See similar questions with these tags.