PERL Script 7 : How to Extract Fasta Sequence Header

Suppose if have a text file with thousands of FASTA sequences and want to extract out the FASTA header or description the it it would be really tedious job.. But PERL has very easy solution of this problem. There may be two scenario of your problem. First, your FASTA header has accession number only, not any description like this 
                                                                               HOW TO SAVE PERL SCRIPT
>Seq1
CACCATCCTGCCCTTGTTCCTTCCATTATACAGCTGTCTTTGCCCTCTCCTTCTCTCGCTGGACTGTTCACCAACTCTCAGCCCGCGATCCCAATTTCCAGACAACCCATCTTATCAGCTTGGCCACGGCCTCGACCCGAACAGACCGGCGTCCAGCGAGAAGAGCGTCGCCTCGACGCCTCTGCTTGACCGCACCTTGATGCTCAAGACTTATCGCGATGCCAAGAAGCGTCTCATCATGTTCGACTACGA
>Seq2
CGAAACGGGCACCTATACAACGATTGAAACCATTATTCAAGCTCAGCAAGCGTCTATGCTAGCGGTTATTGCGAGCACTTCAGCGGTTGCTACTACGACTACTACTTGATAAATGAAACGGCTATAAAAGAGGCTGGGGCAAAAGTATGTTAGTTGAAGGGTGACCTGAACGATGAATCGGTCGAATTTTTTATTGGCAGAGGGAAGGTAGGTTTACTCAATTTAGTTACTTCTAGCCGTTGATTGGAGGAGCGCAAGCGACGAGGAGGCTCATCGGCCGCCCGCGGAAAGCGTAGTCTTACACGGAAATCAACGGCGGTGTCATAAGCGAG
In second situation there may be a description after the accession number like this
>Seq1 long description
CACCATCCTGCCCTTGTTCCTTCCATTATACAGCTGTCTTTGCCCTCTCCTTCTCTCGCTGGACTGTTCACCAACTCTCAGCCCGCGATCCCAATTTCCAGACAACCCATCTTATCAGCTTGGCCACGGCCTCGACCCGAACAGACCGGCGTCCAGCGAGAAGAGCGTCGCCTCGACGCCTCTGCTTGACCGCACCTTGATGCTCAAGACTTATCGCGATGCCAAGAAGCGTCTCATCATGTTCGACTACGA
>Seq2 longer description
CGAAACGGGCACCTATACAACGATTGAAACCATTATTCAAGCTCAGCAAGCGTCTATGCTAGCGGTTATTGCGAGCACTTCAGCGGTTGCTACTACGACTACTACTTGATAAATGAAACGGCTATAAAAGAGGCTGGGGCAAAAGTATGTTAGTTGAAGGGTGACCTGAACGATGAATCGGTCGAATTTTTTATTGGCAGAGGGAAGGTAGGTTTACTCAATTTAGTTACTTCTAGCCGTTGATTGGAGGAGCGCAAGCGACGAGGAGGCTCATCGGCCGCCCGCGGAAAGCGTAGTCTTACACGGAAATCAACGGCGGTGTCATAAGCGAG

Thus, either you may be interested in extracting the accession number or whole description.. So solution may vary for it. First of all lets assume you want to extract whole FASTA header. Your file name is INPUT.TXT and you want to save your result in RESULT.TXT. Then you can use this PERL script

Column.pl
#!/usr/bin/perl -w
use strict;
# Downloaded from http://www.bioinformatics-made-simple.com

open (IN, "input.txt") or die "Couldn't find filename.ext: $!\n";
open (OUT, ">result.txt") or die "Couldn't open fasta_names.txt: $!\n";

while (<IN>) {
if (/^>(.+)/) {

print OUT "$1\n";
}
}

Output of Column.pl would be in a column but what if you want your result in tab limited format? Then You can use Tab.pl  for this purpose

#!/usr/bin/perl -w
use strict;
# Downloaded from http://www.bioinformatics-made-simple.com
open (IN, "input.txt") or die "Couldn't find filename.ext: $!\n";
open (OUT, ">result.txt") or die "Couldn't open fasta_names.txt: $!\n";

while (<IN>) {
if (/^>(.+)/) {
print OUT "$1\t";
}
}

print OUT "\n";

Final output will be like this


Column.pl output

Seq No 1 long description        
Seq No 2 longer description

Tab.pl output      Seq No 1 long description        Seq No 2 longer description

Now suppose you want to extract the accession numbers only then you should use these PERL scripts
Tab modified.pl
#!/usr/bin/perl -w
use strict;
# Downloaded from http://www.bioinformatics-made-simple.com
open (IN, "input.txt") or die "Couldn't find filename.ext: $!\n";
open (OUT, ">result.txt") or die "Couldn't open fasta_names.txt: $!\n";

while (<IN>) {
if (/^>(.+?)\s/) {

print OUT "$1\t";
}
}

print OUT "\n";

Column1.pl
#!/usr/bin/perl -w
use strict;
# Downloaded from http://www.bioinformatics-made-simple.com

open (IN, "input.txt") or die "Couldn't find filename.ext: $!\n";
open (OUT, ">result.txt") or die "Couldn't open fasta_names.txt: $!\n";

while (<IN>) {
if (/^>(.+?)\s/) {
print OUT "$1\n";
}
}

Hope these PERL scripts will be useful for sequence analysis studies. Alternatively, you also try FaBox to extract FASTA header from a large dataset.

Update


This script can also help you to extract the FASTA header from your sequence file

#!/usr/bin/perl

open(FASTA, "<input.txt");
while(<FASTA>) {
    chomp($_);
    if ($_ =~  m/^>/ ) {
        my $header = $_;
        print "$header\n";
    }
}

When I was learning the regular expression by PERL today at that time I wrote this PERL script to extract FASTA header. Hope this will also help someone
#!/usr/bin/perl
 
#FASTA header extract
print "name of your file";

$input = <>;

open (INPUT, $input);

open (OUT, ">result.txt");

while ($input = <INPUT>)
{

if ($input =~ m/\>[^a-z]*/){

print (OUT $input);

}

}

2 comments:

  1. Hi,
    I tried to use this string, but it seems to work not properly. This is the error message:" Can't locate Bio/SeqIO.pm". Can you help me?
    Many Thanks

    ReplyDelete
    Replies
    1. Hi franz,
      All perl scripts given in this post are working perfectly. Are you sure that you have problem with perl script given on this page because error you have posted are realted to bioperl and none of these perl scripts are depend upon bioperl. So please check it twice.

      Delete

Have Problem ?? Drop a comments here!