How to parse a particular ids from a text file?

Question

I have a lengthy text file, partial file content is shown below,

[{"site":"1a2v_1","pfam":"Cu_amine_oxid","uniprot":"P12807"},{"site":"1a2v_2","pfam":"Cu_amine_oxid","uniprot":"P12807"},{"site":"1a2v_3","pfam":"Cu_amine_oxid","uniprot":"T12807"},{"site":"1a2v_4","pfam":"Cu_amine_oxid","uniprot":"P12808"},{"site":"1a2v_5","pfam":"Cu_amine_oxid","uniprot":"Z12809"},{"site":"1a2v_6","pfam":"Cu_amine_oxid","uniprot":"P12821"},{"site":"1a3z_1","pfam":"Copper-bind,SoxE","uniprot":"P0C918"},

I need to parse uniprot ids from the above text file and the expected outcome is given below,

P12807 P12807 T12807 P12808 Z12809 P12821 P0C918

In order to do the same, I have tried the following commands but nothing works for me,

sed -e 's/"uniprot":"\(.*\)"},{"site":"/\1/' file.txt cat file.txt | sed 's/.*"uniprot":" //' | sed 's/"site":".*$//'

Kindly help me to parse the ids as mentioned above.

Thanks in advance.

When dealing with structured data (like XML or JSON), it is best to use a dedicated parser like xmlstarlet or jq since they don't rely on a specific layout of the data. — AdminBee, CommentedJun 15, 2021 at 10:54

Andrew Stubbs · Accepted Answer · 2021-06-16 09:11:54Z

If you're on a Linux system, you can very easily do:

$ grep -oP '"uniprot":"\K[^"]+' file P12807 P12807 T12807 P12808 Z12809 P12821 P0C918

The -o tells grep to only print the matching portion of each line and the -P enables Perl Compatible Regular Expressions. The regex is looking for "uniprot":" but then discards it (the \K means "discard anything matched so far", so that it isn't included in the output). Then, you just look for the longest stretch of non-" ([^"]+).

Of course, this looks like JSON data so for anything more complicated, you should use a proper parser for it like jq. If you fix your file by adding a closing ] and make it like this:

[{"site":"1a2v_1","pfam":"Cu_amine_oxid","uniprot":"P12807"},{"site":"1a2v_2","pfam":"Cu_amine_oxid","uniprot":"P12807"},{"site":"1a2v_3","pfam":"Cu_amine_oxid","uniprot":"T12807"},{"site":"1a2v_4","pfam":"Cu_amine_oxid","uniprot":"P12808"},{"site":"1a2v_5","pfam":"Cu_amine_oxid","uniprot":"Z12809"},{"site":"1a2v_6","pfam":"Cu_amine_oxid","uniprot":"P12821"},{"site":"1a3z_1","pfam":"Copper-bind,SoxE","uniprot":"P0C918"}]

You can do:

$ jq -r '.[].uniprot' file P12807 P12807 T12807 P12808 Z12809 P12821 P0C918

I have Ubuntu 18.04 LTS OS system. Your command serve my purpose perfectly. Thank you for your detailed explanation on the syntax. I got it. — Kumar, CommentedJun 15, 2021 at 10:54
I like you answer better - I didn't know about the perl extension for GNU grep! thanks — mattb, CommentedJun 15, 2021 at 10:56
If any of the entries lack a uniprot key, you'd get null values. To skip these, use .[].uniprot // empty. — Kusalananda, CommentedJun 15, 2021 at 11:09
+1 for the jq answer, since that tool should be installed anyway. — Konrad Rudolph, CommentedJun 15, 2021 at 22:13

guest_7 · Accepted Answer · 2021-06-15 15:35:42Z

If you notice carefully, your input file is a Python data structure. In particular, it is a list of dictionaries. We need to append a closing square bracket.

By means of the ast module we can serialize the string which is a valid Python data structure .

python3 -c 'import sys, ast ifile,key = sys.argv[1:] str = "" with open(ifile) as fh: for l in fh: str += l.rstrip() lod = ast.literal_eval(str) for d in lod: print(d[key]) ' file uniprot

P12807 P12807 T12807 P12808 Z12809 P12821 P0C918

Prabhjot Singh · Accepted Answer · 2021-06-15 11:37:23Z

Using gawk:

awk 'BEGIN{RS=","} /uniprot/{print gensub(/.*("uniprot":")(.*)".*/, "\\2", "g") }' input

In this command, input Record Separator(RS) is set to comma.

Then gawk built-in function gensub() replaces line with desired pattern using backreferencing(\\2).

mattb · Accepted Answer · 2021-06-15 11:41:19Z

This will do it for your example (with grep and sed):

grep -o '"uniprot":"[^"]*"' your_file | sed 's/.*:"\(.*\)"/\1/'

The ways it's working is as follows:

First we print only-matching parts of a grep search to get:

"uniprot":"P12807" "uniprot":"P12807" "uniprot":"T12807" "uniprot":"P12808" "uniprot":"Z12809" "uniprot":"P12821" "uniprot":"P0C918"

Then we pipe that to sed and use a capture group to remember the stuff in the last string, and replace each line with only that string to get:
```
P12807 P12807 T12807 P12808 Z12809 P12821 P0C918 
```

This would fail if the input data was change to be the equivalent but pretty printed JSON. — Kusalananda, CommentedJun 15, 2021 at 11:00
@Kusalananda this seems to work for pretty-printed JSON, but I don't have the general solution (pretty-printed/non-pretty-printed) yet: cat file_test.txt | jq | grep -o '"uniprot": "[^"]*"' | sed 's/.*: "$.*$"/\1/' — jubilatious1, CommentedJun 16, 2021 at 15:37

steve · Accepted Answer · 2022-06-22 19:28:55Z

Perl 5 solution

$ perl -nle 'print join"\n",m/uniprot\":\"(.*?)\"/g' file.txt P12807 P12807 T12807 P12808 Z12809 P12821 P0C918 $

jubilatious1 · Accepted Answer · 2022-06-23 02:57:58Z

Using Raku (formerly known as Perl_6)

raku -e 'slurp.comb( / <[{]>~<[}]> .+? / ).comb( / <["]>~<["]> .+? / ).map( *.subst: q["], :global ).[ 5, { $_ + 6 }...* ].join( Q:b[\n] ).put;'

OR

raku -e '.put for slurp.comb( / <[{]>~<[}]> .+? / ).comb( / <["]>~<["]> .+? / ).map( *.subst: q["], :global ).[ 5, { $_ + 6 }...* ];'

Sample Input:

[{"site":"1a2v_1","pfam":"Cu_amine_oxid","uniprot":"P12807"},{"site":"1a2v_2","pfam":"Cu_amine_oxid","uniprot":"P12807"},{"site":"1a2v_3","pfam":"Cu_amine_oxid","uniprot":"T12807"},{"site":"1a2v_4","pfam":"Cu_amine_oxid","uniprot":"P12808"},{"site":"1a2v_5","pfam":"Cu_amine_oxid","uniprot":"Z12809"},{"site":"1a2v_6","pfam":"Cu_amine_oxid","uniprot":"P12821"},{"site":"1a3z_1","pfam":"Copper-bind,SoxE","uniprot":"P0C918"}]

Sample Output:

P12807 P12807 T12807 P12808 Z12809 P12821 P0C918

Briefly:

the file is slurped into Raku,
comb is used to select out curly-brace wrapped elements,
comb is again used to select out double-quoted elements,
in the selected elements, double-quotes are removed,
an index starting from 0 is generated taking every sixth element (i.e. the uniprot_ID), and
first example: uniprot_IDs are joined by \n and out.put.

I've tried to remove quoting characters as much as possible, to increase portability. Most importantly, the code above still works even if jq or jq -r, or jq -c are run on the input file, prior to feeding into Raku, e.g.:

cat uniprot_file.txt | jq -c | raku -e '...'

Addendum: It appears that Raku's JSON::Tiny module can parse a cleaned-up version of the OP's input file (with terminal characters corrected), but it requires a little extra work:

list-ifying the result of the from-json() call, and
split-ting on ", " comma-space to recover the hash objects.

~$ raku -MJSON::Tiny -e 'my @a = from-json($_).list given slurp; \ .<uniprot>.put for @a>>.split(", ");' uniprot_test.txt P12807 P12807 T12807 P12808 Z12809 P12821 P0C918

https://forum.codeselfstudy.com/t/how-to-use-json-in-raku/2519
https://raku.org/

another option would be: raku -ne '.say for .comb: /"\"uniprot\":\""<(\w+)>"\""/' file.txt — SmokeMachine, CommentedJun 16, 2021 at 7:33
With better quotes: raku -ne '.say for .comb: /\"uniprot\"\:\"<(\w+)>\"/' bla.json — SmokeMachine, CommentedJun 16, 2021 at 7:40
It seems jq would not work with that data, that's not a valid json. But here's another version that will work even if the input is well formatted: raku -e '.say for slurp.comb: /\"uniprot\"<.ws>\:<.ws>\"<(\w+)>\"/' ble.json Another option that would work for that "broken" data (I still need to work on its speed and make it easier to use) is using JSON::Stream like this: raku -MJSON::Stream -e 'react whenever json-stream(@*ARGS.head.IO.words.Supply, [[q"$", *, "uniprot"],]) { .say }' bla.json — SmokeMachine, CommentedJun 17, 2021 at 16:05
I like the JSON::Stream option because, even not being valid json, you can use json-path-like indexes to get it — SmokeMachine, CommentedJun 17, 2021 at 16:10
With the new version of JSON::Stream (github.com/FCO/JSON-Stream), it became easier to set the JSON-path: raku -MJSON::Stream -e 'react whenever json-stream "bla.json".IO.open.Supply, <$.*.uniprot> {.say }' — SmokeMachine, CommentedJun 17, 2021 at 19:15

Stack Exchange Network

How to parse a particular ids from a text file?

6 Answers 6

You must log in to answer this question.

Hot Network Questions

How to parse a particular ids from a text file?

6 Answers 6

You must log in to answer this question.

Related

Hot Network Questions