How to Get Accession Numbers from FASTA File from GenBank

Say if have retrieved multiple FASTA Sequences using Uniprot ID from Uniprot or NCBI and now you want to get the accession numbers from FASTA files from GenBank. It is easy if you have few FASTA sequence but it would be really tough job if the number of sequences you are handling are big. You can used FA-BOX tools to do the same also. Here I am going to share a PERL script that can be used to edit the FASTA files.


1. PERL Script for GenBank FASTA File

#!/usr/bin/perl

# get the accession numbers from FASTA file

while (<>){
    chomp;
    next unless (/^>/);
    s/^>\s*//;

    my @line = split /\s+/;

    my $first = shift (@line);
    my @numbers = split /\|/, $first;
    
    $accNum = $numbers[3];
    $accNum =~ s/\.\d+$//;

    print "$accNum\t\t# " . $_ . "\n";
}

HOW TO SAVE THE PERL SCRIPT HERE

Uses

perl genebank-access.pl input.fasta >result.fasta

Input file

>gi|6701965|gb|AW295329.1|AW295329 UI-H-BI2-ahv-a-12-0-UI.s1 NCI_CGAP_Sub4 Homo sapiens cDNA clone IMAGE:2727863 3', mRNA sequence
TTTTTTTTTTTTTTTTTGGATAATCTAATACATCTTTATTAATCTCAAAATATGTCTTTTTGGTCAATAA
CATTCTTTTAATAACCATTAAAGATAATCTATTATTAAAAGGTTAATTTTGCTTCCCTTAGATAAATATA
ACAGAATTCAAATACGAACTTGCCCCGTTCTGTTGAAGTGTAGTGAGAGCTGACTATGTGAACTGAATAA
TGCTTGTCCTGTACAGAATGGTGAAGAAAAGCAAACTTTTGTTCACCTGGGAACACTTTTTAAAAATACA
CTGATACTTAACTTAAATAATTGAATACATATCATAAACACACAAATTACACCATTTAAAATATACTTAC
TACAAAAAGATCCTGAACATTATTGATTAATGCAAATAAAACTACTTCGCATTATTATAAAAGTCAAAAT
ATTTTACATCACCCTCGTGCCG

>gi|6701969|gb|AW295333.1|AW295333 UI-H-BI2-ahv-b-04-0-UI.s1 NCI_CGAP_Sub4 Homo sapiens cDNA clone IMAGE:2727895 3', mRNA sequence
TTTTTTTTTATTTTTTTAGTTTTTAATTATGCATATTTCTTTAAATTCCCAGAAGTAGGATTTCTGGGTC
AAGGATATGAACATAATTTAATGCTTGCCAAATTGCCTTTCAAAAAGGTTGTGTCAATTTATACTTTTCC
TTCGGCAGTGCAGGATGAATACTGGTTTCACCACAGCCTTACCAACATTGGCTATTTCCAGTTTTCTTCC
TAAATTAATAGGTGAAAAATGGGTCTTGTTATCTAACTTGCATTTCTTTGATTACCAGTGAGGTTGAATG
TCTTTATAAGCTTCTTTCCTAACAGGTTTTTTTTCCTTATTCCCATTGTCTATTTATATGCTTTGTCCAT
TTGTTTGTTGGTGGGAGGAGATTGCAGTCTTTTTCTTACCAATTTATATGATAAAGAAGAAGGGAGTTCA
GGCTAGTTGAAGCTCTGGCCTGTTGTTTTATTCACAAGCTCAATCTGGAGCTTCAG

>gi|6702094|gb|AW295458.1|AW295458 UI-H-BI2-ahw-e-05-0-UI.s1 NCI_CGAP_Sub4 Homo sapiens cDNA clone IMAGE:2728400 3', mRNA sequence
TTTTTTTTTTTTTTTTTAAGCAAAGTTTAAGTGTTCTTTATGTGTGCTTTTTGTCCCATACCTAAGAAAC
CATTAAGGCACAGGATTTGAATGAATATCTGGCTCCAGCATTCATGTTCTTAGCCATTACACTACACTGC
CTGTCTAGAAACTAAAGGCTGTTAGTAATATTGCACATTAAGTTATTATAAGGGTTTCATGTTGTGAACT
AAAAAACACCCATGGGGCAGACACCCTATATTCATCCTCCAACATCAGCATGGACTTCTGAAGGATTCTG
GGCTGGAACTGTGATGCCATGTATTAGGGAGAAGGAATGAGGCTCGGGCAAGAGCAGGACTGGTCCCCAT
GCTGCACACCACACACCACTGTGATCTCCATGTGCTTCCTCTCGGCCGTCACAGTCTCTCCCAAACAGAG
ATTCACACACAGACACCCCTAATGCTGGGAGGGGTTCACTT

>gi|6702105|gb|AW295469.1|AW295469 UI-H-BI2-ahw-f-04-0-UI.s1 NCI_CGAP_Sub4 Homo sapiens cDNA clone IMAGE:2728446 3', mRNA sequence
TTTTTTTTTTTTTTTTTACATATGCTATACATACATTTATTTACAGTTAGGTATACGCTTCTTAGAAGCC
TTTAAATAGCAAATTAGGCAATGCACAAGGAATAGTCCATTTAATTTTAAGTAATATATATAAAGTTCTC
ATTTTAAAATTATATTAGTAGGGATGTAGCCATCTCCAGGAAGGCTAAGAAAGTTCTTGGAAATTCTTGC
ATCTCCAACATTCCAACTTCTTAACGTTTCCATTTTTAGGCATTAACCGCGTCCTTCAGTTCCTCTTTCT
CTCGCTTTCTCACATGCACACTCTCTGCCACACTCGTCACTTTAGCTTCCTCTTCCCTGAGTTTGATGTC
ACACTGGGCAAAACTCAGATTTTAGTTTCTGGGCCCTCAGCAATGAGGGGCTGGAAAGAGTTTCTAATCC
TGGCAACTTCTGCTGCACACAGATGTCCAAAGCCTTTGCTGTACCATTCTGGGGAGTGACCTTTCAGGTC
TATCCTGCAGATGTGAGTCAGTCTTGACTTGCCAGAGCCACACGG

>gi|6702160|gb|AW295524.1|AW295524 UI-H-BI2-ahx-c-04-0-UI.s1 NCI_CGAP_Sub4 Homo sapiens cDNA clone IMAGE:2728303 3', mRNA sequence
CGGCCGCCAACTTTTTTTACTTTTTTAAAGTGGGCGTTAGGAGCATGCACAAAGACCATGCATTGTCACA
TCATCAGGCGAACAAATGGAAAGGTAATAAATAATTGCTCCTAACGCCCACTTTAAAATAGTAAAACCAA
ATGGAACAGTGAAGTGAGCCGGTCTTCACTGCTCTGAGCAGGAAGCGTGCTAGCGGCCAGATGGGGCCAC
AGTCAGATGACATGGCAACAGCTGGCTGGGCTGGATGCGCACAGCAGCCCCTCCTTAGGGTTCCGCTGGA
TGTTCACAGAGATGGTCTTTTCGCACAGAGGTAGCTGCAGCACGTTATTTTTCAGGTTCTTGCTCTTTAT
CCGCCTCGTGCCG

Result file

AW295329  # gi|6701965|gb|AW295329.1|AW295329 UI-H-BI2-ahv-a-12-0-UI.s1 NCI_CGAP_Sub4 Homo sapiens cDNA clone IMAGE:2727863 3', mRNA sequence
AW295333  # gi|6701969|gb|AW295333.1|AW295333 UI-H-BI2-ahv-b-04-0-UI.s1 NCI_CGAP_Sub4 Homo sapiens cDNA clone IMAGE:2727895 3', mRNA sequence
AW295458  # gi|6702094|gb|AW295458.1|AW295458 UI-H-BI2-ahw-e-05-0-UI.s1 NCI_CGAP_Sub4 Homo sapiens cDNA clone IMAGE:2728400 3', mRNA sequence
AW295469  # gi|6702105|gb|AW295469.1|AW295469 UI-H-BI2-ahw-f-04-0-UI.s1 NCI_CGAP_Sub4 Homo sapiens cDNA clone IMAGE:2728446 3', mRNA sequence
AW295524  # gi|6702160|gb|AW295524.1|AW295524 UI-H-BI2-ahx-c-04-0-UI.s1 NCI_CGAP_Sub4 Homo sapiens cDNA clone IMAGE:2728303 3', mRNA sequence

2. PERL Script for Uniprot FASTA File

#!/usr/bin/perl

# get the accession numbers from FASTA file

while (<>){
    chomp;
    next unless (/^>/);
    s/^>\s*//;

    my @line = split /\s+/;

    my $first = shift (@line);
    my @numbers = split /\|/, $first;
    
    $accNum = $numbers[1];
    $accNum =~ s/\.\d+$//;  # remove version numbers

    print "$accNum\t\t# " . $_ . "\n";
}

Uses

perl uniprot-access.pl input.fasta >result.fasta

Input file

>tr|Q8I944|Q8I944_9SPIT DNA polymerase OS=Urostyla grandis GN=type I DNA polymerase alpha PE=3 SV=1
MSKQTTSSGRVIKKVSDKKQDALAQFKAAREGKEKRTEQYKGDEHKKIFEEIDAEEYDAL
YDQRENDDFIVDDDGIGYKEKGGEIWDYEESDDDYSNKEKKVKSKKKKQEGGDIAAFMFP
TQGLNKRKGGNTGVQGIKKQGKVNEAQSKDLLNELMDDFDNKPIDELEDIHTAHQALNVD
DSNFALSKEQQMMNKYNVIIQPQATTNIVVQEQKQIEVKKRSLEEMRQSNSTIKNNGHNE
SRVSDQNKTQQAMNNTSYQTALDHSVVQQIDQVEQEKMAIDEEWQLIKEQNEQMKVMTAA
SSESVDSYPLPVNKDKELAFFWFDAHEENMGLDVFLFGKVYQPELKQYVSCSLKVNGMQR
IVYALPKVSKNKSRAELTKEEEQEMALKIFSELDGIRKNKFPSISQWKCKVAKRKYAFEM
>tr|Q8I945|Q8I945_9SPIT DNA polymerase OS=Urostyla grandis GN=type II DNA polymerase alpha PE=3 SV=1
MSKQTTSSGRVIKKVSDKKQDALAQFKAAREGKEKRTEQYKGDEHKKIFEEIDAEEYDAL
YDQRENDDFIVDDDGIGYKEKGGEIWDYEESDDDYSNKEKKVKSKKKKQEGGDIAAFMFP
TQGLNKRKGGNTGVQGIKKQGKVNEAQSKDLLNELMDDFDNKPIDELEDIHTAHQALNVD
DSNFALSKEQQMMNKYNVIIPPQATTNIVVQEQKQIEVKKRSLEEMRQSNSTIKNNGHNE
SRVSDQNKTQQAMNNTSYQTALDHSVVQQIDQVEQEKMAIDEEWQLIKEQNEQMKVMTAA
SSESVDSYPLPVNKDKELAFFWFDAHEENMGLDVFLFGKVYQPELKQYVSCSLKVNGMQR
IVYALPKVSKNKSRAELTKEEEQEMALKIFSELDGIRKNKFPSISQWKCKVAKRKYAFEM
>tr|Q6QXG1|Q6QXG1_GVAS DNA polymerase OS=Agrotis segetum granulosis virus GN=ORF101,DNA polymerase PE=3 SV=1
MDDDYLYCDYDEIDIPPIIRSSPKRKLYDEHETTPVLKKERDNSSVEKDGECSSKYKKEP
VCETSEDFEVCSNLLEKVVKSDRETAHYSANCVFKITKLHYSSSFLYIFLTGNDNVQYYF
KTYCPIYSYKLCTHRFQSCRFNCQSYKSLVVTGLKSRECHRVNVIKMERSKCSGEKYLLD
EMCNDVNRVQMQTGIYEGDYVRFKDGITVDENGCATGAVSELVKVTMEELTQPIDPIVGS
YDLETFTDGMRFSNSEVDPIITISYVLRKQNNNMSRYCFINTNGKRFRLNDVYLANAEYC
>tr|M5B5N0|M5B5N0_PLEWA DNA polymerase mu OS=Pleurodeles waltl GN=polymerase mu PE=2 SV=1
MTLPLRKRRRPPPAVADSSQGAVRFPEVGIFLVEKRMGSSRRAFLSKLARSKGFRVEAVY
SDTVTHVVSEQNTRDEVCEWLQAQPGPGRLDTPALLDVSWFTESMASGSPVLIEPRHCLV
SSQCPESDASEVEGPTVPVYACQRRTALPNWNQILTDALEILAEEAEFGNSEGRSLAFAR
AAPVLRSIPYAVTRFEDLNSLPCFGAHSRKIVQEITEDGSSVEVQRVLHSERYRTLKVFS
GIFGVGKKTADRWYQEGLRTLDDLRKKEKKLNRQQEAGLQHYTDLNSPVTRLEADKIQHV
VQDAVLRFLPGAIITLTGGFQRGKQSGHDVDFLITHPTEGKEMGLLIKVVSWLSSQGLLL
YHHMKQNSYKEPTQMSVQASKDRLDHFESCFSIFKLDTPNEQMESSSTAAENIRNWKALR



Result file

Q8I944  # tr|Q8I944|Q8I944_9SPIT DNA polymerase OS=Urostyla grandis GN=type I DNA polymerase alpha PE=3 SV=1
Q8I945  # tr|Q8I945|Q8I945_9SPIT DNA polymerase OS=Urostyla grandis GN=type II DNA polymerase alpha PE=3 SV=1
Q6QXG1  # tr|Q6QXG1|Q6QXG1_GVAS DNA polymerase OS=Agrotis segetum granulosis virus GN=ORF101,DNA polymerase PE=3 SV=1
M5B5N0  # tr|M5B5N0|M5B5N0_PLEWA DNA polymerase mu OS=Pleurodeles waltl GN=polymerase mu PE=2 SV=1

No comments:

Post a Comment

Have Problem ?? Drop a comments here!