I am having some fun running AlphaPulldown on a computing cluster. A requirement is to have input sequences in FASTA format. I found that I needed to get ~600 sequences. I had a list of the relevant Uniprot IDs. Surely getting the sequences for these proteins should be straightforward?

Solution

The Uniprot IDs can be converted – using the ID Mapping Tool on the Uniprot website – into any number of other IDs. There’s no option to “convert to sequences” or anything like that. So, the solution is to map the Uniprot IDs to Uniprot IDs! Well, mapping UniProtKB_AC-ID → UniProtKB by pasting the list of Uniprot IDs into the box (or uploading the file); either way the result is a collection of the proteins from the list.

Once the collection is on screen, the Download button can be used to get all of the sequences in FASTA format, uncompressed.

We get the FASTA sequences which look like this (two are shown):

 >sp|P35613|BASI_HUMAN Basigin OS=Homo sapiens OX=9606 GN=BSG PE=1 SV=2 MAAALFVLLGFALLGTHGASGAAGFVQAPLSQQRWVGGSVELHCEAVGSPVPEIQWWFEG QGPNDTCSQLWDGARLDRVHIHATYHQHAASTISIDTLVEEDTGTYECRASNDPDRNHLT RAPRVKWVRAQAVVLVLEPGTVFTTVEDLGSKILLTCSLNDSATEVTGHRWLKGGVVLKE DALPGQKTEFKVDSDDQWGEYSCVFLPEPMGTANIQLHGPPRVKAVKSSEHINEGETAML VCKSESVPPVTDWAWYKITDSEDKALMNGSESRFFVSSSQGRSELHIENLNMEADPGQYR CNGTSSKGSDQAIITLRVRSHLAALWPFLGIVAEVLVLVTIIFIYEKRRKPEDVLDDDDA GSAPLKSSGQHQNDKGKNVRQRNSS >sp|P10321|HLAC_HUMAN HLA class I histocompatibility antigen, C alpha chain OS=Homo sapiens OX=9606 GN=HLA-C PE=1 SV=3 MRVMAPRALLLLLSGGLALTETWACSHSMRYFDTAVSRPGRGEPRFISVGYVDDTQFVRF DSDAASPRGEPRAPWVEQEGPEYWDRETQKYKRQAQADRVSLRNLRGYYNQSEDGSHTLQ RMSGCDLGPDGRLLRGYDQSAYDGKDYIALNEDLRSWTAADTAAQITQRKLEAARAAEQL RAYLEGTCVEWLRRYLENGKETLQRAEPPKTHVTHHPLSDHEATLRCWALGFYPAEITLT WQRDGEDQTQDTELVETRPAGDGTFQKWAAVVVPSGQEQRYTCHMQHEGLQEPLTLSWEP SSQPTIPIMGIVAGLAVLVVLAVLGAVVTAMMCRRKSSGGKGGSCSQAACSNSAQGSDES LITCKA

If this is what you need, great! However, I needed a further bit of wrangling for AlphaPulldown.

The FASTA header (the bit after the >) needs to be

 >P35613

rather than

 >sp|P35613|BASI_HUMAN Basigin OS=Homo sapiens OX=9606 GN=BSG PE=1 SV=2

So we need a further bit of work on the command line. This awk command will reformat the FASTA header to one that only has the Uniprot ID.

 awk '/^>sp\|.*\|/{gsub(/^>sp\|/,""); gsub(/\|.*/,""); print ">" $0; next} {print}' input.fasta > output.fasta

Some further explanation

Why did I write this post?

If I struggle to find a quick answer for something that should be painless – especially if it caused me to expend some energy to solve it – I should really write it up. To help other people or the “future me”.

What did I try?

The reason this part is not at the top of the post is that this is not a recipe website where I talk about my life journey before giving you what you want – the recipe. Scroll up for the solution.

I had a script in R which queries the Uniprot API (this is how I initially got my list of Uniprot IDs). So, I could have edited this script to also get the sequences and then use them to make a file in FASTA format. However this script does lots of things and I thought “there must be an easier way”.

I found this StackOverflow question asking the exact same thing. Annoyingly, the example code resulted in retrieval of about 30 of my 600 IDs. So close! I don’t know why it failed.

The question also says “I know I can do it from UniProt website”. Well, I had already looked at that and couldn’t figure how to do it. The help page gives an example where a predefined set of proteins can be downloaded but it was not obvious how to do this for a list of proteins, since the search bar only works for a single query (AFAICT). However, when the R script didn’t work, I figured out the solution above.

—

This post is part of a series of tips.