How to Extract Multiple Sequence from Multi Fasta File with PERL

Yesterday, we talked about NCBI BLAST parsing. I shared two PERL script than can parsed BLAST results in tabulated manner so that you can access your NCBI BLAST output easily. Finally you can extract out those queries without hits or with hits. By the help of those PERL script you can find you best hits also. Suppose you have identified those queries which have no hits in BLAST and now want to remove those queried from main FASTA sequence file then it would not be easy if number of those queries are high. In that case, this PERL script can save your day. this PERL codelet was written by my friend HARRY



Extract Seq.pl
#!/usr/local/bin/perl

print "Enter Input code file name.\n";

$f=<STDIN>;
$f=~s/\n//g;
#f="test.out";

print "Enter Input Sequence file name.\n";

$fs=<STDIN>;
$fs=~s/\n//g;

print "Output codefile name.\n";

$fw=<STDIN>;
$fw=~s/\n//g;
$flag=0;

print "Output mainfile name.\n";

$fw1=<STDIN>;
$fw1=~s/\n//g;
$flag1=0;

unless (open(DA,$f))
{
  print "Input file is not open.\n";
}
@da=<DA>;
close DA;

unless (open(DATA,$fs))
{
  print "Input file is not open.\n";
}
@daseq=<DATA>;
close DATA;

unless (open(WR,">$fw"))
{
  print "Output file is not open.\n";
}

unless (open(WR1,">$fw1"))
{
  print "Output file is not open.\n";
}

foreach $fl(@daseq)
{
  $ID="ZZZZZZZZZZZZZZZZZZZZZZZZ";
  
  if($fl=~/^\>(.........)/)
  {
    $ID=$1;
    
  foreach $line(@da)
  {

    if($line=~/(.........)/)
    {
      if($ID eq $1)
      {
        $flag1=1;
      }
    }
  }
  
  }
  
  if($flag1==1)
  {
    print WR "$fl";
    $flag++;
  }
  else
  {
    print WR1 "$fl";
  }
  
  if($flag==2)
  {
    $flag=0;
    $flag1=0;
  }
}


close WR;
close WR1;

Uses
it is very easy to use this PERL script. Just run the script from your command prompt, it will ask four file name -
  • File name with accession number
  • File with FASTA sequences
  • File name where you want to write the FASTA sequences of your queries
  • File name where you would like to write rest of sequences
Limitations
I run this PERL script using these kind of accession numbers but hope it will work for other kind of queries also
>Vocar20001498m
Brachypodium distachyon
AT1G51370

Instead of being a wonderful tool, this script has one limitation. You sequence must be in single line FASTA file format not in multi - line FASTA format. But it should not be a problem as I have already shared a PERL script  to convert  multi - line FASTA format to single line FASTA format.

>right format
MVGGKKKTKICDKVSHEEDRISQLPEPLISEILFHLSTKDSVRTSALSTKWRYLWQSVPGLDLDPYASSNTNTIVSFVESMVGGKKKTKICDKVSHEEDRISQLPEPLISEILFHLSTKDSVRTSALSTKWRYLWQSVPGLDLDPYASSNTNTIVSFVES
>wrong format
MVGGKKKTKICDKVSHEEDRISQLPEPLISEILFHLSTKDSVRTSAL
STKWRYLWQSVPGLDLDPYASSNTNTIVSFVESMVGGKKKTKICDKV
SHEEDRISQLPEPLISEILFHLSTKDSVRTSALSTKWRYLWQSVPGL
DLDPYASSNTNTIVSFVES
Hope this Bioinformatics tutorial would be useful to extract out the sequence from your multi-fasta files in sequence analysis studies.

7 comments:

  1. HI,
    It's not working for sequence headers like:

    >lcl|FR796397.1_cdsid_CBZ11804.1 [gene=LMJF_01_0010] [protein=hypothetical protein] [protein_id=CBZ11804.1] [location=complement(3704..4702)]


    Please help.

    ReplyDelete
    Replies
    1. try to simplify your FASTA header

      Delete
    2. Thanks much for the direction ...

      Do you know any code that will shorten the fasta header of a multi-fasta file...

      Please ... need your help .. I am new to perl.

      Thanks in advance
      NIL

      Delete
  2. Thanks a lot for this code, I was struggling to extract a subset of protein annotations from a predicted aa file.

    ReplyDelete

Have Problem ?? Drop a comments here!