How to retrieve NCBI GenBank records with a range of accession numbers

I have previously discussed the way to download the sequence from NCBI database if you have a list of accession numbers. But what if you have a range of accession numbers (e.g. EF100000…………EF102000). If you search by a single accession number in the NCBI GenBank then you have no problem pulling up a record, but obviously you would not like to do this for thousands of EST records. So what is the easiest way to retrieve all these records when you way provide a range of accession numbers simultaneously from GenBank? Actually there are two way to perform this task. You can use either GenBank’s web interface or can go for command line option also if you are comfortable with. For command line option you can use, as usual, PERL script to query the GenBank .

1. GenBank’s web interface 
This is the easiest way to download multiple sequences from NCBI GenBank if you have a range of accession numbers. It can be done in few steps:

  • Go to the NCBI webpage
  • Choose the database (protein, nucleotide, EST,GSS)
  • Give the range of accession numbers followed by [accn] tag
For example I have an accession number range EF100000…………EF102000 so my query will be like this EF100000:EF102000[accn]. Click HERE for actual example.



2. Command line option
If you don't like to leave your desktop then command line is made for you. Download the NCBI search PERL script and search the database rite from you shell.


Script name Download
NCBI search.pl

Uses
perl ncbi_search.pl -q EF100000:EF102000[accn] -o results.txt -d nucleotide -r fasta -m 2000

In this command PERL script will search the nucleotide database with query EF100000:EF102000[accn] and save all 2000 sequences in result.txt file in FASTA format.

Options
 -q [STRING]     : raw query text (Required)
 -o [FILE]           : output file to create (Required).
 -d [STRING]     : name of the NCBI database to search, such as 'nucleotide', 'pubmed' (Required).
 -r [STRING]     : the type of information requested. For sequences, 'fasta' is often used.
 -m [INTEGER]    : the maximum number of records to return (Optional)

9 comments:

  1. Hi! I was wondering what types of query terms we can use with "ncbi_search.pl" ? For example, would the NCBI web search like:

    bacteria AND 16S AND (cave OR karst OR aquifer OR groundwater OR mine OR lava) AND 750 : 2000[SLEN] NOT soil NOT river NOT coal NOT tailings NOT potassium NOT drainage NOT landfill


    I don't know if [] and () were acceptable.

    Thank you,

    ReplyDelete
    Replies
    1. Hi,
      Nice question. I used this command
      perl ncbi_search -q bacteria AND 16S AND (cave OR karst OR aquifer OR groundwater OR mine OR lava) AND 750 : 2000[SLEN] NOT soil NOT river NOT coal NOT tailings NOT potassium NOT drainage NOT landfill -o result.txt -d protein -m 10 -r fasta

      and I found this
      >gi|811383865|emb|CPN35748.1| oxidoreductase [Bordetella pertussis]
      MSRPVVLVTGASRGIGRAIAQRLLADGYDVVNFSRGKPADVLPGERFVPVDLSDTEAARRAATELAAQRE
      VLHLVNNAGLIEVAGIDQVAPDAMQRTLALNLVAPLVLLQALLPGMRARGYGRVVNIGSRAALGKPGRSA
      YGASKAGLAGMSRTWALELAPAGITVNVVAPGPIATELFNQSNPPGDPRTRQLEAAIPVGRVGRPDEVAH
      AVASLLDPRAGFITGQVLYVCGGMTV

      >gi|811383864|emb|CPN35718.1| putattive exported protein [Bordetella pertussis]
      MKRTPGQWRACRAAVRALIAACMATLAAQAGAAGYPEHPVTVVVPYPPGGAADIFGRAIAHAMQPHLKQT
      VLVENRPGAGGNVGMTYVTRSKPDGYTLGLGTIGTQSINQFLYADMPYDPGRDLVPVALVSTTPNVLAVS
      ARSPYRTLADVIEAARQRKDNKLTYASPGVGSSVHLAGAYFEAVAGISLLHVPFKGTSASLPAVAGGQVD
      LLFDNLPGALAQIKDGSLVRGIAITSAQRDPSVPDLPTFAEAGVAGFDVTAWFALYAPRGTPQPVAAARQ
      GLQAPELARQFGAMGARPGDKFGAQLGQFEQAERQKWGELIKQRGIRAQ

      >gi|811383863|emb|CPN35681.1| acyl-CoA synthetase [Bordetella pertussis]
      MDGSGAQATPGPAGLAAHALAALHGPAPLPAPNAGVLCFTTSGTTSLPKFVLHDQDTLLRHGDAIARSYG
      YDDDSRILASAPFCGAFGFATLVGALARGVPVICEPAFDAARSVAAVRRHRVTHTYANNEALVQMFRLGE
      RADFATARLFGFASFAPALGDLLPLARAQGVPLTGLYGSSELIALVAAQPREPADGDVSVRYEPGGALIH
      PEARVRARDPQDGRILTDGESGEIEILAPSLMRGYLDNPQASAGALTDDGYFRTGDLGYTLGTRQFVFQT
      RMGDSLRLSGFLVNPAEIEQAVEALPGIRACQVVGATRDGKTVPYAFVLLDAGASPDPPGWMAACRQGMA
      GFKVPAGFQVLEAFPVVESANSAKIQKHKLREQAEALLAAAPAA

      >gi|811383862|emb|CPN35646.1| transcriptional regulator [Bordetella pertussis]
      MSAARLFARSPQPLYLQAAAVFRGHIQTGVWRPGRQIPPLDALAARYGIARSTVRQALGLLEADGLIRRS
      RGSGTFVEDTLPETPTLLIPKNWAETVALSRQLGTVALHESSAQQPLPDDLGIPCDFARDGRFQYLRRLH
      TAAAGPFCFSEVYLESALFRKHRAEIRASTVAPVLDRHYRARLSHARQVLNVIEAGAASAQVLRIPVSAP
      VAELRRYVCIDARVVYFARLEFPFHKVRMEFDLLP

      >gi|811383861|emb|CPN35610.1| adenylyltransferase [Bordetella pertussis]
      MAPMNSQPSVIAPACEADAERRFGGLARLYGPDAPAALRGAHVAVAGLGGVGSWTAEALARCGVGALTLI
      DLDHIAESNVNRQIHALSDTLGQAKIEAMAQRIGQINPACAVTRVDEFVAPDNVMQVLGGPYAAIVDCTD
      QAAAKIAMILHARQRGVPLLLCGGAGGKTDPLALRAGDLSEAVNDALLAKLRNKLRREHGFPRASDANGK
      VRKRVPRMGVRALWFDQPAILPDAWTRAVEGEDDMGAAGTRAAPQGLSCAGYGSVVTVTAAMGLAAANEV
      LRWVVGKPGA

      >gi|811383860|emb|CPN35585.1| inner membrane efflux protein [Bordetella pertussis]
      MLNHMALTGGRITVSLTALKLGLSTFTVGMLVAVFAVLPMFASVHAGRWVDKIGVVRPLVIGSSLVTFGT
      ALPFVSQTQAALLVASCCIGIGFMLHQVATQDLLGHAEPRERLRNFSWMSLALAASGFSGPLIAGLAIDH
      LGTRLAFGMLALGPMLSLVGLYLLRQPLRTMNGALTGGSNRPAERRRITELLAVPPLRRILMVNTILSGA
      WDTHLFVVPIFGVAIGLSATTIGVILAAFAAATFVIRLVLPFIQTRVRSWTLVRAAMATAAIDFLLYPFF
      TDVGMLIGLSFVLGLALGCCQPSMLSLLHQYSPPGRAAEAVGLRMALINASQVSLPLTFGALGAVIGVAP
      LFWAYALALVAGGWANRNPPQESDSSKSS

      >gi|811383859|emb|CPN35555.1| multidrug efflux-associated protein [Bordetella pertussis]
      MNTQSPPSPHDPHGAAALPFRLWTAEERRASLDTALREWRDGEDVWVYGYGSLIWRPDFDFVERRLATLH
      GHHRALCLWSRVNRGTPECPGLVFGLDRGGSCRGVVYRLAGRQVPDYFPALWDREMSTGAYLPRWLRCAT
      EHGPVNALVFIMNRANPAYIRALPEPELLAIVRRASGRYGPCTEYVVQTAQALRQAGIRDARLEQIARQL
      EADVHPLGV

      >gi|811383858|emb|CPN35523.1| pyridoxamine 5'-phosphate oxidase [Bordetella pertussis]
      MSVSDLRQSYEKGVLVEEQAAASPFQQFARWFDEAVAARVPEPNAMTLATVNAEGQPSARIVLIKGYDDA
      GFVFFTNYESRKGLDLDANPRASLLFFWQPLERQVRIEGVIEKVSAAESDEYFHSRPLGSRLGAWASRQS
      QPITRDELEAREREFRDRYGEHPPRPPHWGGYRLKPNRFEFWQGRPSRLHDRLRYEPDGKQGWTIDRLSP

      >gi|811383857|emb|CPN35499.1| monooxygenase component [Bordetella pertussis]
      MSAAFPPAFDAAFFRTALGRFATGVTVATTAGPDGQPVGLTVSSFNSVSLNPPLILWSLARTSSSLAAFE
      RCQRYVVNVLSASQIALARRFATGKTPERFAGLTLAQAPAGTPMLGEGCAAWFECRNRSRYEEGDHIIMV
      GQVEHCGHSGVPPLVFHAGGFDLTPPHGGASS

      >gi|811383856|emb|CPN35452.1| oxidase [Bordetella pertussis]
      MSTDIDCIVIGAGVVGLAIARALAAGGHEVLVAEAAEGIGTGTSSRNSEVIHAGIYYPADSLKARLCVRG
      KHLLYEYCAARGVPHQRLGKLIVATSDAEASQLDSIARRAGANGVDDLQHIDGAAARRLEPALHCTAALV
      SPSTGIVDSHALMLAYQGDAESDGAQLVFHTPLIAGRVRPEGGFELDFGGAEPMTLSCRVLINAAGLHAP
      GLARRIEGIPRDSIPPEYLCKGSYFTLAGRAPFSRLIYPVPQHAGLGVHLTLDLGGQAKFGPDTEWIATE
      DYTLDPRRADVFYAAVRSYWPALPDGALAPGYTGIRPKISGPHEPAADFAIAGPASHGVAGLVNLYGIES
      PGLTASLAIAEETLARLAA



      SO I think you can use this kind of command.

      Delete
  2. Hi! I was wondering what types of query terms we can use with "ncbi_search.pl" ? For example, would the NCBI web search like:

    bacteria AND 16S AND (cave OR karst OR aquifer OR groundwater OR mine OR lava) AND 750 : 2000[SLEN] NOT soil NOT river NOT coal NOT tailings NOT potassium NOT drainage NOT landfill


    I don't know if [] and () were acceptable.

    Thank you,

    ReplyDelete
  3. Hi, Priyanka! first, thanks for the piece of code.

    I was just trying to use it, but i am hitting this error:
    "perl ncbi_search.pl -q Abarema -o results.txt -d nucleotid -r fasta
    Use of uninitialized value $esearch_result in pattern match (m//) at ncbi_search.pl line 75."

    Any thoughts on what could be the problem?

    Thanks!

    Leo

    ReplyDelete
    Replies
    1. Hi Leo,
      Thanks for stopping by. PLease make sure that you have installed the Bioperl on your system. It is 'nucleotid' not 'nucleotid'. I successfully run the script using your command and result file start like this
      >KY046023.1 UNVERIFIED: Abarema turbinata isolate 150038132 tRNA-Lys (trnK) gene, partial sequence; and maturase K-like (matK) gene, complete sequence; chloroplast
      ATGGAGGAATTTCAAGTATATTTAGAACTAGATAGATCTCGTCAACATGACTTCCTATACCCACTTATTT
      TTCGGGAGTATCTTTTTGCACTTGCTTACGATCATGGTTTAAATAGTTCCATTTTGGTGCAAAATCTAGG
      TTATGACAATAAATCTAGTTTACTAATTGTAAAACGTTTAATTACTCGAATGTATCAACAGAATCATTTG



      Please let me know if you got any problem. Thanks.


      Delete
  4. No, it doesn't seem to work. With this query: perl ncbi_search.pl -q (100:1000[SLEN] AND "internal transcribed spacer 1" AND "Exophiala dermatitidis"[Organism]) -o results.fas -d nucleotide -r fasta -m 500 I always obtain the same 396 bacteria strains. I am sure bacteria don't have an ITS region and Exophiala is definitely a fungus. The same query on the NCBI website will give the right sequences. I don't know what is going on, because in the perl script this query is sent to NCBI.

    ReplyDelete
  5. Comment on my previous reply: the solution is found. I put the total query in parentheses "" -> "100:1000[SLEN] AND internal transcribed spacer 1 AND Exophiala dermatitidis[ORGN]".

    ReplyDelete

Have Problem ?? Drop a comments here!