1

I have a lengthy text file, partial file content is shown below,

[{"site":"1a2v_1","pfam":"Cu_amine_oxid","uniprot":"P12807"},{"site":"1a2v_2","pfam":"Cu_amine_oxid","uniprot":"P12807"},{"site":"1a2v_3","pfam":"Cu_amine_oxid","uniprot":"T12807"},{"site":"1a2v_4","pfam":"Cu_amine_oxid","uniprot":"P12808"},{"site":"1a2v_5","pfam":"Cu_amine_oxid","uniprot":"Z12809"},{"site":"1a2v_6","pfam":"Cu_amine_oxid","uniprot":"P12821"},{"site":"1a3z_1","pfam":"Copper-bind,SoxE","uniprot":"P0C918"}, 

I need to parse uniprot ids from the above text file and the expected outcome is given below,

P12807 P12807 T12807 P12808 Z12809 P12821 P0C918 

In order to do the same, I have tried the following commands but nothing works for me,

sed -e 's/"uniprot":"\(.*\)"},{"site":"/\1/' file.txt cat file.txt | sed 's/.*"uniprot":" //' | sed 's/"site":".*$//' 

Kindly help me to parse the ids as mentioned above.

Thanks in advance.

2
  • 5
    When dealing with structured data (like XML or JSON), it is best to use a dedicated parser like xmlstarlet or jq since they don't rely on a specific layout of the data.
    – AdminBee
    CommentedJun 15, 2021 at 10:54
  • I think you should accept the other answer instead of mine
    – mattb
    CommentedJun 15, 2021 at 11:04

6 Answers 6

13

If you're on a Linux system, you can very easily do:

$ grep -oP '"uniprot":"\K[^"]+' file P12807 P12807 T12807 P12808 Z12809 P12821 P0C918 

The -o tells grep to only print the matching portion of each line and the -P enables Perl Compatible Regular Expressions. The regex is looking for "uniprot":" but then discards it (the \K means "discard anything matched so far", so that it isn't included in the output). Then, you just look for the longest stretch of non-" ([^"]+).


Of course, this looks like JSON data so for anything more complicated, you should use a proper parser for it like jq. If you fix your file by adding a closing ] and make it like this:

[{"site":"1a2v_1","pfam":"Cu_amine_oxid","uniprot":"P12807"},{"site":"1a2v_2","pfam":"Cu_amine_oxid","uniprot":"P12807"},{"site":"1a2v_3","pfam":"Cu_amine_oxid","uniprot":"T12807"},{"site":"1a2v_4","pfam":"Cu_amine_oxid","uniprot":"P12808"},{"site":"1a2v_5","pfam":"Cu_amine_oxid","uniprot":"Z12809"},{"site":"1a2v_6","pfam":"Cu_amine_oxid","uniprot":"P12821"},{"site":"1a3z_1","pfam":"Copper-bind,SoxE","uniprot":"P0C918"}] 

You can do:

$ jq -r '.[].uniprot' file P12807 P12807 T12807 P12808 Z12809 P12821 P0C918 
4
  • 1
    I have Ubuntu 18.04 LTS OS system. Your command serve my purpose perfectly. Thank you for your detailed explanation on the syntax. I got it.
    – Kumar
    CommentedJun 15, 2021 at 10:54
  • 1
    I like you answer better - I didn't know about the perl extension for GNU grep! thanks
    – mattb
    CommentedJun 15, 2021 at 10:56
  • 2
    If any of the entries lack a uniprot key, you'd get null values. To skip these, use .[].uniprot // empty.
    – Kusalananda
    CommentedJun 15, 2021 at 11:09
  • +1 for the jq answer, since that tool should be installed anyway.CommentedJun 15, 2021 at 22:13
2

If you notice carefully, your input file is a Python data structure. In particular, it is a list of dictionaries. We need to append a closing square bracket.

By means of the ast module we can serialize the string which is a valid Python data structure .

python3 -c 'import sys, ast ifile,key = sys.argv[1:] str = "" with open(ifile) as fh: for l in fh: str += l.rstrip() lod = ast.literal_eval(str) for d in lod: print(d[key]) ' file uniprot 

P12807 P12807 T12807 P12808 Z12809 P12821 P0C918 
    2

    Using gawk:

    awk 'BEGIN{RS=","} /uniprot/{print gensub(/.*("uniprot":")(.*)".*/, "\\2", "g") }' input 

    In this command, input Record Separator(RS) is set to comma.

    Then gawk built-in function gensub() replaces line with desired pattern using backreferencing(\\2).

      2

      This will do it for your example (with grep and sed):

      grep -o '"uniprot":"[^"]*"' your_file | sed 's/.*:"\(.*\)"/\1/' 

      The ways it's working is as follows:

      • First we print only-matching parts of a grep search to get:

        "uniprot":"P12807" "uniprot":"P12807" "uniprot":"T12807" "uniprot":"P12808" "uniprot":"Z12809" "uniprot":"P12821" "uniprot":"P0C918" 
      • Then we pipe that to sed and use a capture group to remember the stuff in the last string, and replace each line with only that string to get:

        P12807 P12807 T12807 P12808 Z12809 P12821 P0C918 
      2
      • 3
        This would fail if the input data was change to be the equivalent but pretty printed JSON.
        – Kusalananda
        CommentedJun 15, 2021 at 11:00
      • @Kusalananda this seems to work for pretty-printed JSON, but I don't have the general solution (pretty-printed/non-pretty-printed) yet: cat file_test.txt | jq | grep -o '"uniprot": "[^"]*"' | sed 's/.*: "\(.*\)"/\1/'CommentedJun 16, 2021 at 15:37
      1

      Perl 5 solution

      $ perl -nle 'print join"\n",m/uniprot\":\"(.*?)\"/g' file.txt P12807 P12807 T12807 P12808 Z12809 P12821 P0C918 $ 
        1

        Using Raku (formerly known as Perl_6)

        raku -e 'slurp.comb( / <[{]>~<[}]> .+? / ).comb( / <["]>~<["]> .+? / ).map( *.subst: q["], :global ).[ 5, { $_ + 6 }...* ].join( Q:b[\n] ).put;' 

        OR

        raku -e '.put for slurp.comb( / <[{]>~<[}]> .+? / ).comb( / <["]>~<["]> .+? / ).map( *.subst: q["], :global ).[ 5, { $_ + 6 }...* ];' 

        Sample Input:

        [{"site":"1a2v_1","pfam":"Cu_amine_oxid","uniprot":"P12807"},{"site":"1a2v_2","pfam":"Cu_amine_oxid","uniprot":"P12807"},{"site":"1a2v_3","pfam":"Cu_amine_oxid","uniprot":"T12807"},{"site":"1a2v_4","pfam":"Cu_amine_oxid","uniprot":"P12808"},{"site":"1a2v_5","pfam":"Cu_amine_oxid","uniprot":"Z12809"},{"site":"1a2v_6","pfam":"Cu_amine_oxid","uniprot":"P12821"},{"site":"1a3z_1","pfam":"Copper-bind,SoxE","uniprot":"P0C918"}] 

        Sample Output:

        P12807 P12807 T12807 P12808 Z12809 P12821 P0C918 

        Briefly:

        1. the file is slurped into Raku,
        2. comb is used to select out curly-brace wrapped elements,
        3. comb is again used to select out double-quoted elements,
        4. in the selected elements, double-quotes are removed,
        5. an index starting from 0 is generated taking every sixth element (i.e. the uniprot_ID), and
        6. first example: uniprot_IDs are joined by \n and out.put.

        I've tried to remove quoting characters as much as possible, to increase portability. Most importantly, the code above still works even if jq or jq -r, or jq -c are run on the input file, prior to feeding into Raku, e.g.:

        cat uniprot_file.txt | jq -c | raku -e '...' 

        Addendum: It appears that Raku's JSON::Tiny module can parse a cleaned-up version of the OP's input file (with terminal characters corrected), but it requires a little extra work:

        1. list-ifying the result of the from-json() call, and
        2. split-ting on ", " comma-space to recover the hash objects.
        ~$ raku -MJSON::Tiny -e 'my @a = from-json($_).list given slurp; \ .<uniprot>.put for @a>>.split(", ");' uniprot_test.txt P12807 P12807 T12807 P12808 Z12809 P12821 P0C918 

        https://forum.codeselfstudy.com/t/how-to-use-json-in-raku/2519
        https://raku.org/

        9
        • 1
          another option would be: raku -ne '.say for .comb: /"\"uniprot\":\""<(\w+)>"\""/' file.txtCommentedJun 16, 2021 at 7:33
        • 1
          With better quotes: raku -ne '.say for .comb: /\"uniprot\"\:\"<(\w+)>\"/' bla.jsonCommentedJun 16, 2021 at 7:40
        • 1
          It seems jq would not work with that data, that's not a valid json. But here's another version that will work even if the input is well formatted: raku -e '.say for slurp.comb: /\"uniprot\"<.ws>\:<.ws>\"<(\w+)>\"/' ble.json Another option that would work for that "broken" data (I still need to work on its speed and make it easier to use) is using JSON::Stream like this: raku -MJSON::Stream -e 'react whenever json-stream(@*ARGS.head.IO.words.Supply, [[q"$", *, "uniprot"],]) { .say }' bla.jsonCommentedJun 17, 2021 at 16:05
        • 1
          I like the JSON::Stream option because, even not being valid json, you can use json-path-like indexes to get itCommentedJun 17, 2021 at 16:10
        • 1
          With the new version of JSON::Stream (github.com/FCO/JSON-Stream), it became easier to set the JSON-path: raku -MJSON::Stream -e 'react whenever json-stream "bla.json".IO.open.Supply, <$.*.uniprot> {.say }'CommentedJun 17, 2021 at 19:15

        You must log in to answer this question.

        Start asking to get answers

        Find the answer to your question by asking.

        Ask question

        Explore related questions

        See similar questions with these tags.