PERL Script 7 : How to Extract Fasta Sequence Header
|
Suppose if have a text file with thousands of FASTA sequences and want to extract out the FASTA header or description the it it would be really tedious job.. But PERL has very easy solution of this problem. There may be two scenario of your problem. First, your FASTA header has accession number only, not any description like this
HOW TO SAVE PERL SCRIPT
HOW TO SAVE PERL SCRIPT
>Seq1
CACCATCCTGCCCTTGTTCCTTCCATTATACAGCTGTCTTTGCCCTCTCCTTCTCTCGCTGGACTGTTCACCAACTCTCAGCCCGCGATCCCAATTTCCAGACAACCCATCTTATCAGCTTGGCCACGGCCTCGACCCGAACAGACCGGCGTCCAGCGAGAAGAGCGTCGCCTCGACGCCTCTGCTTGACCGCACCTTGATGCTCAAGACTTATCGCGATGCCAAGAAGCGTCTCATCATGTTCGACTACGA
>Seq2
CGAAACGGGCACCTATACAACGATTGAAACCATTATTCAAGCTCAGCAAGCGTCTATGCTAGCGGTTATTGCGAGCACTTCAGCGGTTGCTACTACGACTACTACTTGATAAATGAAACGGCTATAAAAGAGGCTGGGGCAAAAGTATGTTAGTTGAAGGGTGACCTGAACGATGAATCGGTCGAATTTTTTATTGGCAGAGGGAAGGTAGGTTTACTCAATTTAGTTACTTCTAGCCGTTGATTGGAGGAGCGCAAGCGACGAGGAGGCTCATCGGCCGCCCGCGGAAAGCGTAGTCTTACACGGAAATCAACGGCGGTGTCATAAGCGAG
In second situation there may be a description after the accession number like this>Seq1 long description
CACCATCCTGCCCTTGTTCCTTCCATTATACAGCTGTCTTTGCCCTCTCCTTCTCTCGCTGGACTGTTCACCAACTCTCAGCCCGCGATCCCAATTTCCAGACAACCCATCTTATCAGCTTGGCCACGGCCTCGACCCGAACAGACCGGCGTCCAGCGAGAAGAGCGTCGCCTCGACGCCTCTGCTTGACCGCACCTTGATGCTCAAGACTTATCGCGATGCCAAGAAGCGTCTCATCATGTTCGACTACGA
>Seq2 longer description
CGAAACGGGCACCTATACAACGATTGAAACCATTATTCAAGCTCAGCAAGCGTCTATGCTAGCGGTTATTGCGAGCACTTCAGCGGTTGCTACTACGACTACTACTTGATAAATGAAACGGCTATAAAAGAGGCTGGGGCAAAAGTATGTTAGTTGAAGGGTGACCTGAACGATGAATCGGTCGAATTTTTTATTGGCAGAGGGAAGGTAGGTTTACTCAATTTAGTTACTTCTAGCCGTTGATTGGAGGAGCGCAAGCGACGAGGAGGCTCATCGGCCGCCCGCGGAAAGCGTAGTCTTACACGGAAATCAACGGCGGTGTCATAAGCGAG
Thus, either you may be interested in extracting the accession number or whole description.. So solution may vary for it. First of all lets assume you want to extract whole FASTA header. Your file name is INPUT.TXT and you want to save your result in RESULT.TXT. Then you can use this PERL script
Column.pl
#!/usr/bin/perl -w
use strict;
# Downloaded from http://www.bioinformatics-made-simple.com
open (IN, "input.txt") or die "Couldn't find filename.ext: $!\n";
open (OUT, ">result.txt") or die "Couldn't open fasta_names.txt: $!\n";
while (<IN>) {
if (/^>(.+)/) {
print OUT "$1\n";
}
}
Output of Column.pl would be in a column but what if you want your result in tab limited format? Then You can use Tab.pl for this purpose
#!/usr/bin/perl -w
use strict;
# Downloaded from http://www.bioinformatics-made-simple.com
open (IN, "input.txt") or die "Couldn't find filename.ext: $!\n";
open (OUT, ">result.txt") or die "Couldn't open fasta_names.txt: $!\n";
while (<IN>) {
if (/^>(.+)/) {
print OUT "$1\t";
}
}
print OUT "\n";
Final output will be like this
Column.pl output
Seq No 1 long description
Seq No 2 longer description
Tab.pl output Seq No 1 long description Seq No 2 longer description
Now suppose you want to extract the accession numbers only then you should use these PERL scripts
Tab modified.pl
#!/usr/bin/perl -w
use strict;
# Downloaded from http://www.bioinformatics-made-simple.com
open (IN, "input.txt") or die "Couldn't find filename.ext: $!\n";
open (OUT, ">result.txt") or die "Couldn't open fasta_names.txt: $!\n";
while (<IN>) {
if (/^>(.+?)\s/) {
print OUT "$1\t";
}
}
print OUT "\n";
Column1.pl
#!/usr/bin/perl -w
use strict;
# Downloaded from http://www.bioinformatics-made-simple.com
open (IN, "input.txt") or die "Couldn't find filename.ext: $!\n";
open (OUT, ">result.txt") or die "Couldn't open fasta_names.txt: $!\n";
while (<IN>) {
if (/^>(.+?)\s/) {
print OUT "$1\n";
}
}
Hope these PERL scripts will be useful for sequence analysis studies. Alternatively, you also try FaBox to extract FASTA header from a large dataset.
Update
This script can also help you to extract the FASTA header from your sequence file
#!/usr/bin/perl
open(FASTA, "<input.txt");
while(<FASTA>) {
chomp($_);
if ($_ =~ m/^>/ ) {
my $header = $_;
print "$header\n";
}
}
When I was learning the regular expression by PERL today at that time I wrote this PERL script to extract FASTA header. Hope this will also help someone
#!/usr/bin/perl #FASTA header extract print "name of your file"; $input = <>; open (INPUT, $input); open (OUT, ">result.txt"); while ($input = <INPUT>) { if ($input =~ m/\>[^a-z]*/){ print (OUT $input); } }
Related Posts HOW TO,
Perl Script,
Sequence analysis
|
Was This Post Useful? Add This To Del.icio.us Share on Facebook StumbleUpon This Add to Technorati Share on Twitter |
Labels:
HOW TO,
Perl Script,
Sequence analysis
Subscribe to:
Post Comments (Atom)
Hi,
ReplyDeleteI tried to use this string, but it seems to work not properly. This is the error message:" Can't locate Bio/SeqIO.pm". Can you help me?
Many Thanks
Hi franz,
DeleteAll perl scripts given in this post are working perfectly. Are you sure that you have problem with perl script given on this page because error you have posted are realted to bioperl and none of these perl scripts are depend upon bioperl. So please check it twice.