Using Raku (formerly known as Perl_6)
raku -e 'slurp.comb( / <[{]>~<[}]> .+? / ).comb( / <["]>~<["]> .+? / ).map( *.subst: q["], :global ).[ 5, { $_ + 6 }...* ].join( Q:b[\n] ).put;'
OR
raku -e '.put for slurp.comb( / <[{]>~<[}]> .+? / ).comb( / <["]>~<["]> .+? / ).map( *.subst: q["], :global ).[ 5, { $_ + 6 }...* ];'
Sample Input:
[{"site":"1a2v_1","pfam":"Cu_amine_oxid","uniprot":"P12807"},{"site":"1a2v_2","pfam":"Cu_amine_oxid","uniprot":"P12807"},{"site":"1a2v_3","pfam":"Cu_amine_oxid","uniprot":"T12807"},{"site":"1a2v_4","pfam":"Cu_amine_oxid","uniprot":"P12808"},{"site":"1a2v_5","pfam":"Cu_amine_oxid","uniprot":"Z12809"},{"site":"1a2v_6","pfam":"Cu_amine_oxid","uniprot":"P12821"},{"site":"1a3z_1","pfam":"Copper-bind,SoxE","uniprot":"P0C918"}]
Sample Output:
P12807 P12807 T12807 P12808 Z12809 P12821 P0C918
Briefly:
- the file is slurped into Raku,
comb
is used to select out curly-brace wrapped elements,comb
is again used to select out double-quoted elements,- in the selected elements, double-quotes are removed,
- an index starting from 0 is generated taking every sixth element (i.e. the uniprot_ID), and
- first example: uniprot_IDs are joined by
\n
and out.put
.
I've tried to remove quoting characters as much as possible, to increase portability. Most importantly, the code above still works even if jq
or jq -r
, or jq -c
are run on the input file, prior to feeding into Raku, e.g.:
cat uniprot_file.txt | jq -c | raku -e '...'
Addendum: It appears that Raku's JSON::Tiny
module can parse a cleaned-up version of the OP's input file (with terminal characters corrected), but it requires a little extra work:
list
-ifying the result of the from-json()
call, andsplit
-ting on ",
" comma-space to recover the hash objects.
~$ raku -MJSON::Tiny -e 'my @a = from-json($_).list given slurp; \ .<uniprot>.put for @a>>.split(", ");' uniprot_test.txt P12807 P12807 T12807 P12808 Z12809 P12821 P0C918
https://forum.codeselfstudy.com/t/how-to-use-json-in-raku/2519
https://raku.org/
xmlstarlet
orjq
since they don't rely on a specific layout of the data.