How to Get Accession Numbers from FASTA File from GenBank
|
Say if have retrieved multiple FASTA Sequences using Uniprot ID from Uniprot or NCBI and now you want to get the accession numbers from FASTA files from GenBank. It is easy if you have few FASTA sequence but it would be really tough job if the number of sequences you are handling are big. You can used FA-BOX tools to do the same also. Here I am going to share a PERL script that can be used to edit the FASTA files.
1. PERL Script for GenBank FASTA File
#!/usr/bin/perl # get the accession numbers from FASTA file while (<>){ chomp; next unless (/^>/); s/^>\s*//; my @line = split /\s+/; my $first = shift (@line); my @numbers = split /\|/, $first; $accNum = $numbers[3]; $accNum =~ s/\.\d+$//; print "$accNum\t\t# " . $_ . "\n"; }
HOW TO SAVE THE PERL SCRIPT HERE
Uses
perl genebank-access.pl input.fasta >result.fasta
Input file
>gi|6701965|gb|AW295329.1|AW295329 UI-H-BI2-ahv-a-12-0-UI.s1 NCI_CGAP_Sub4 Homo sapiens cDNA clone IMAGE:2727863 3', mRNA sequence TTTTTTTTTTTTTTTTTGGATAATCTAATACATCTTTATTAATCTCAAAATATGTCTTTTTGGTCAATAA CATTCTTTTAATAACCATTAAAGATAATCTATTATTAAAAGGTTAATTTTGCTTCCCTTAGATAAATATA ACAGAATTCAAATACGAACTTGCCCCGTTCTGTTGAAGTGTAGTGAGAGCTGACTATGTGAACTGAATAA TGCTTGTCCTGTACAGAATGGTGAAGAAAAGCAAACTTTTGTTCACCTGGGAACACTTTTTAAAAATACA CTGATACTTAACTTAAATAATTGAATACATATCATAAACACACAAATTACACCATTTAAAATATACTTAC TACAAAAAGATCCTGAACATTATTGATTAATGCAAATAAAACTACTTCGCATTATTATAAAAGTCAAAAT ATTTTACATCACCCTCGTGCCG >gi|6701969|gb|AW295333.1|AW295333 UI-H-BI2-ahv-b-04-0-UI.s1 NCI_CGAP_Sub4 Homo sapiens cDNA clone IMAGE:2727895 3', mRNA sequence TTTTTTTTTATTTTTTTAGTTTTTAATTATGCATATTTCTTTAAATTCCCAGAAGTAGGATTTCTGGGTC AAGGATATGAACATAATTTAATGCTTGCCAAATTGCCTTTCAAAAAGGTTGTGTCAATTTATACTTTTCC TTCGGCAGTGCAGGATGAATACTGGTTTCACCACAGCCTTACCAACATTGGCTATTTCCAGTTTTCTTCC TAAATTAATAGGTGAAAAATGGGTCTTGTTATCTAACTTGCATTTCTTTGATTACCAGTGAGGTTGAATG TCTTTATAAGCTTCTTTCCTAACAGGTTTTTTTTCCTTATTCCCATTGTCTATTTATATGCTTTGTCCAT TTGTTTGTTGGTGGGAGGAGATTGCAGTCTTTTTCTTACCAATTTATATGATAAAGAAGAAGGGAGTTCA GGCTAGTTGAAGCTCTGGCCTGTTGTTTTATTCACAAGCTCAATCTGGAGCTTCAG >gi|6702094|gb|AW295458.1|AW295458 UI-H-BI2-ahw-e-05-0-UI.s1 NCI_CGAP_Sub4 Homo sapiens cDNA clone IMAGE:2728400 3', mRNA sequence TTTTTTTTTTTTTTTTTAAGCAAAGTTTAAGTGTTCTTTATGTGTGCTTTTTGTCCCATACCTAAGAAAC CATTAAGGCACAGGATTTGAATGAATATCTGGCTCCAGCATTCATGTTCTTAGCCATTACACTACACTGC CTGTCTAGAAACTAAAGGCTGTTAGTAATATTGCACATTAAGTTATTATAAGGGTTTCATGTTGTGAACT AAAAAACACCCATGGGGCAGACACCCTATATTCATCCTCCAACATCAGCATGGACTTCTGAAGGATTCTG GGCTGGAACTGTGATGCCATGTATTAGGGAGAAGGAATGAGGCTCGGGCAAGAGCAGGACTGGTCCCCAT GCTGCACACCACACACCACTGTGATCTCCATGTGCTTCCTCTCGGCCGTCACAGTCTCTCCCAAACAGAG ATTCACACACAGACACCCCTAATGCTGGGAGGGGTTCACTT >gi|6702105|gb|AW295469.1|AW295469 UI-H-BI2-ahw-f-04-0-UI.s1 NCI_CGAP_Sub4 Homo sapiens cDNA clone IMAGE:2728446 3', mRNA sequence TTTTTTTTTTTTTTTTTACATATGCTATACATACATTTATTTACAGTTAGGTATACGCTTCTTAGAAGCC TTTAAATAGCAAATTAGGCAATGCACAAGGAATAGTCCATTTAATTTTAAGTAATATATATAAAGTTCTC ATTTTAAAATTATATTAGTAGGGATGTAGCCATCTCCAGGAAGGCTAAGAAAGTTCTTGGAAATTCTTGC ATCTCCAACATTCCAACTTCTTAACGTTTCCATTTTTAGGCATTAACCGCGTCCTTCAGTTCCTCTTTCT CTCGCTTTCTCACATGCACACTCTCTGCCACACTCGTCACTTTAGCTTCCTCTTCCCTGAGTTTGATGTC ACACTGGGCAAAACTCAGATTTTAGTTTCTGGGCCCTCAGCAATGAGGGGCTGGAAAGAGTTTCTAATCC TGGCAACTTCTGCTGCACACAGATGTCCAAAGCCTTTGCTGTACCATTCTGGGGAGTGACCTTTCAGGTC TATCCTGCAGATGTGAGTCAGTCTTGACTTGCCAGAGCCACACGG >gi|6702160|gb|AW295524.1|AW295524 UI-H-BI2-ahx-c-04-0-UI.s1 NCI_CGAP_Sub4 Homo sapiens cDNA clone IMAGE:2728303 3', mRNA sequence CGGCCGCCAACTTTTTTTACTTTTTTAAAGTGGGCGTTAGGAGCATGCACAAAGACCATGCATTGTCACA TCATCAGGCGAACAAATGGAAAGGTAATAAATAATTGCTCCTAACGCCCACTTTAAAATAGTAAAACCAA ATGGAACAGTGAAGTGAGCCGGTCTTCACTGCTCTGAGCAGGAAGCGTGCTAGCGGCCAGATGGGGCCAC AGTCAGATGACATGGCAACAGCTGGCTGGGCTGGATGCGCACAGCAGCCCCTCCTTAGGGTTCCGCTGGA TGTTCACAGAGATGGTCTTTTCGCACAGAGGTAGCTGCAGCACGTTATTTTTCAGGTTCTTGCTCTTTAT CCGCCTCGTGCCG
Result file
AW295329 # gi|6701965|gb|AW295329.1|AW295329 UI-H-BI2-ahv-a-12-0-UI.s1 NCI_CGAP_Sub4 Homo sapiens cDNA clone IMAGE:2727863 3', mRNA sequence AW295333 # gi|6701969|gb|AW295333.1|AW295333 UI-H-BI2-ahv-b-04-0-UI.s1 NCI_CGAP_Sub4 Homo sapiens cDNA clone IMAGE:2727895 3', mRNA sequence AW295458 # gi|6702094|gb|AW295458.1|AW295458 UI-H-BI2-ahw-e-05-0-UI.s1 NCI_CGAP_Sub4 Homo sapiens cDNA clone IMAGE:2728400 3', mRNA sequence AW295469 # gi|6702105|gb|AW295469.1|AW295469 UI-H-BI2-ahw-f-04-0-UI.s1 NCI_CGAP_Sub4 Homo sapiens cDNA clone IMAGE:2728446 3', mRNA sequence AW295524 # gi|6702160|gb|AW295524.1|AW295524 UI-H-BI2-ahx-c-04-0-UI.s1 NCI_CGAP_Sub4 Homo sapiens cDNA clone IMAGE:2728303 3', mRNA sequence
2. PERL Script for Uniprot FASTA File
#!/usr/bin/perl # get the accession numbers from FASTA file while (<>){ chomp; next unless (/^>/); s/^>\s*//; my @line = split /\s+/; my $first = shift (@line); my @numbers = split /\|/, $first; $accNum = $numbers[1]; $accNum =~ s/\.\d+$//; # remove version numbers print "$accNum\t\t# " . $_ . "\n"; }
Uses
perl uniprot-access.pl input.fasta >result.fasta
Input file
>tr|Q8I944|Q8I944_9SPIT DNA polymerase OS=Urostyla grandis GN=type I DNA polymerase alpha PE=3 SV=1 MSKQTTSSGRVIKKVSDKKQDALAQFKAAREGKEKRTEQYKGDEHKKIFEEIDAEEYDAL YDQRENDDFIVDDDGIGYKEKGGEIWDYEESDDDYSNKEKKVKSKKKKQEGGDIAAFMFP TQGLNKRKGGNTGVQGIKKQGKVNEAQSKDLLNELMDDFDNKPIDELEDIHTAHQALNVD DSNFALSKEQQMMNKYNVIIQPQATTNIVVQEQKQIEVKKRSLEEMRQSNSTIKNNGHNE SRVSDQNKTQQAMNNTSYQTALDHSVVQQIDQVEQEKMAIDEEWQLIKEQNEQMKVMTAA SSESVDSYPLPVNKDKELAFFWFDAHEENMGLDVFLFGKVYQPELKQYVSCSLKVNGMQR IVYALPKVSKNKSRAELTKEEEQEMALKIFSELDGIRKNKFPSISQWKCKVAKRKYAFEM >tr|Q8I945|Q8I945_9SPIT DNA polymerase OS=Urostyla grandis GN=type II DNA polymerase alpha PE=3 SV=1 MSKQTTSSGRVIKKVSDKKQDALAQFKAAREGKEKRTEQYKGDEHKKIFEEIDAEEYDAL YDQRENDDFIVDDDGIGYKEKGGEIWDYEESDDDYSNKEKKVKSKKKKQEGGDIAAFMFP TQGLNKRKGGNTGVQGIKKQGKVNEAQSKDLLNELMDDFDNKPIDELEDIHTAHQALNVD DSNFALSKEQQMMNKYNVIIPPQATTNIVVQEQKQIEVKKRSLEEMRQSNSTIKNNGHNE SRVSDQNKTQQAMNNTSYQTALDHSVVQQIDQVEQEKMAIDEEWQLIKEQNEQMKVMTAA SSESVDSYPLPVNKDKELAFFWFDAHEENMGLDVFLFGKVYQPELKQYVSCSLKVNGMQR IVYALPKVSKNKSRAELTKEEEQEMALKIFSELDGIRKNKFPSISQWKCKVAKRKYAFEM >tr|Q6QXG1|Q6QXG1_GVAS DNA polymerase OS=Agrotis segetum granulosis virus GN=ORF101,DNA polymerase PE=3 SV=1 MDDDYLYCDYDEIDIPPIIRSSPKRKLYDEHETTPVLKKERDNSSVEKDGECSSKYKKEP VCETSEDFEVCSNLLEKVVKSDRETAHYSANCVFKITKLHYSSSFLYIFLTGNDNVQYYF KTYCPIYSYKLCTHRFQSCRFNCQSYKSLVVTGLKSRECHRVNVIKMERSKCSGEKYLLD EMCNDVNRVQMQTGIYEGDYVRFKDGITVDENGCATGAVSELVKVTMEELTQPIDPIVGS YDLETFTDGMRFSNSEVDPIITISYVLRKQNNNMSRYCFINTNGKRFRLNDVYLANAEYC >tr|M5B5N0|M5B5N0_PLEWA DNA polymerase mu OS=Pleurodeles waltl GN=polymerase mu PE=2 SV=1 MTLPLRKRRRPPPAVADSSQGAVRFPEVGIFLVEKRMGSSRRAFLSKLARSKGFRVEAVY SDTVTHVVSEQNTRDEVCEWLQAQPGPGRLDTPALLDVSWFTESMASGSPVLIEPRHCLV SSQCPESDASEVEGPTVPVYACQRRTALPNWNQILTDALEILAEEAEFGNSEGRSLAFAR AAPVLRSIPYAVTRFEDLNSLPCFGAHSRKIVQEITEDGSSVEVQRVLHSERYRTLKVFS GIFGVGKKTADRWYQEGLRTLDDLRKKEKKLNRQQEAGLQHYTDLNSPVTRLEADKIQHV VQDAVLRFLPGAIITLTGGFQRGKQSGHDVDFLITHPTEGKEMGLLIKVVSWLSSQGLLL YHHMKQNSYKEPTQMSVQASKDRLDHFESCFSIFKLDTPNEQMESSSTAAENIRNWKALR
Result file
Q8I944 # tr|Q8I944|Q8I944_9SPIT DNA polymerase OS=Urostyla grandis GN=type I DNA polymerase alpha PE=3 SV=1 Q8I945 # tr|Q8I945|Q8I945_9SPIT DNA polymerase OS=Urostyla grandis GN=type II DNA polymerase alpha PE=3 SV=1 Q6QXG1 # tr|Q6QXG1|Q6QXG1_GVAS DNA polymerase OS=Agrotis segetum granulosis virus GN=ORF101,DNA polymerase PE=3 SV=1 M5B5N0 # tr|M5B5N0|M5B5N0_PLEWA DNA polymerase mu OS=Pleurodeles waltl GN=polymerase mu PE=2 SV=1
Related Posts HOW TO,
NCBI,
PERL,
Perl Script,
Sequence analysis
|
Was This Post Useful? Add This To Del.icio.us Share on Facebook StumbleUpon This Add to Technorati Share on Twitter |
Labels:
HOW TO,
NCBI,
PERL,
Perl Script,
Sequence analysis
Subscribe to:
Post Comments (Atom)
No comments:
Post a Comment
Have Problem ?? Drop a comments here!