How to Extract Multiple Sequence from Multi Fasta File with PERL~ Bioinformatics Made Simple.com

How to Extract Multiple Sequence from Multi Fasta File with PERL

Yesterday, we talked about NCBI BLAST parsing. I shared two PERL script than can parsed BLAST results in tabulated manner so that you can access your NCBI BLAST output easily. Finally you can extract out those queries without hits or with hits. By the help of those PERL script you can find you best hits also. Suppose you have identified those queries which have no hits in BLAST and now want to remove those queried from main FASTA sequence file then it would not be easy if number of those queries are high. In that case, this PERL script can save your day. this PERL codelet was written by my friend HARRY

Extract Seq.pl

#!/usr/local/bin/perl

print "Enter Input code file name.\n";

$f=<STDIN>;
$f=~s/\n//g;
#f="test.out";

print "Enter Input Sequence file name.\n";

$fs=<STDIN>;
$fs=~s/\n//g;

print "Output codefile name.\n";

$fw=<STDIN>;
$fw=~s/\n//g;
$flag=0;

print "Output mainfile name.\n";

$fw1=<STDIN>;
$fw1=~s/\n//g;
$flag1=0;

unless (open(DA,$f))
{
  print "Input file is not open.\n";
}
@da=<DA>;
close DA;

unless (open(DATA,$fs))
{
  print "Input file is not open.\n";
}
@daseq=<DATA>;
close DATA;

unless (open(WR,">$fw"))
{
  print "Output file is not open.\n";
}

unless (open(WR1,">$fw1"))
{
  print "Output file is not open.\n";
}

foreach $fl(@daseq)
{
  $ID="ZZZZZZZZZZZZZZZZZZZZZZZZ";
  
  if($fl=~/^\>(.........)/)
  {
    $ID=$1;
    
  foreach $line(@da)
  {

    if($line=~/(.........)/)
    {
      if($ID eq $1)
      {
        $flag1=1;
      }
    }
  }
  
  }
  
  if($flag1==1)
  {
    print WR "$fl";
    $flag++;
  }
  else
  {
    print WR1 "$fl";
  }
  
  if($flag==2)
  {
    $flag=0;
    $flag1=0;
  }
}


close WR;
close WR1;

Uses

How to run PERL script

it is very easy to use this PERL script. Just run the script from your command prompt, it will ask four file name -

File name with accession number
File with FASTA sequences
File name where you want to write the FASTA sequences of your queries
File name where you would like to write rest of sequences

Limitations

I run this PERL script using these kind of accession numbers but hope it will work for other kind of queries also

>Vocar20001498m
Brachypodium distachyon
AT1G51370

Instead of being a wonderful tool, this script has one limitation. You sequence must be in single line FASTA file format not in multi - line FASTA format. But it should not be a problem as I have already shared a PERL script to convert multi - line FASTA format to single line FASTA format.

>right format
MVGGKKKTKICDKVSHEEDRISQLPEPLISEILFHLSTKDSVRTSALSTKWRYLWQSVPGLDLDPYASSNTNTIVSFVESMVGGKKKTKICDKVSHEEDRISQLPEPLISEILFHLSTKDSVRTSALSTKWRYLWQSVPGLDLDPYASSNTNTIVSFVES
>wrong format
MVGGKKKTKICDKVSHEEDRISQLPEPLISEILFHLSTKDSVRTSAL
STKWRYLWQSVPGLDLDPYASSNTNTIVSFVESMVGGKKKTKICDKV
SHEEDRISQLPEPLISEILFHLSTKDSVRTSALSTKWRYLWQSVPGL
DLDPYASSNTNTIVSFVES

Hope this Bioinformatics tutorial would be useful to extract out the sequence from your multi-fasta files in sequence analysis studies.

7 comments:

AnonymousDecember 5, 2012 at 1:09 PM
HI,
It's not working for sequence headers like:

>lcl|FR796397.1_cdsid_CBZ11804.1 [gene=LMJF_01_0010] [protein=hypothetical protein] [protein_id=CBZ11804.1] [location=complement(3704..4702)]

Please help.
ReplyDelete
Replies
Joe CatholicApril 19, 2013 at 2:40 PM
Thanks a lot for this code, I was struggling to extract a subset of protein annotations from a predicted aa file.
ReplyDelete
Replies

Have Problem ?? Drop a comments here!

Pages