How to Filter the Sequence by Their Length~ Bioinformatics Made Simple.com

How to Filter the Sequence by Their Length

PERL scripts are most wonderful stuff because they can help you in almost all aspect of Bioinformatics. PERL script can perform BLAST, parsing, gene prediction and many more for you in very simple way even without going to a particular website. Here let me share another beautiful work of Dr. David Rosenkranz, Institue of Anthropology Johannes Gutenberg University Mainz, Germany. Suppose you have thousand of genes in a file in FASTA format and want to delete or save (in another file) those sequences which are shorter than 300 bp or amino acid then this PERL script will save your life from editing file manually.

PERL Script
Options
Uses

Script name	Download
Seq_filter.pl

This PERL script will generate three files
sequence length ok - Sequence with defined length
sequence too short - Sequence shorter than defined length
sequence too long - Sequence longer than defined length

Options


-i   = File name with sequences
-min = Minimum length of sequences
-max = Maximum length of sequences
-0   = Remove sequences and write to output file sequences_too_long.fas
-1   = Cut the end of the sequence and save to file sequences_ok.fas
-I   = File containing ist of files with sequences

Uses
If my sequences are stored in INPUT.TXT and I want to extract sequences with minimum length 30 and maximum length 60 then my command will be

perl Seq_filter.pl -i INPUT.TXT -min 30 -max 60

If I want to cut the end of the sequence and save to file sequences_ok.fas fiile then my command will be

perl Seq_filter.pl -i INPUT.TXT -min 30 -max 60 -1

If I want to remove sequences and write to output file sequences_too_long.fas then my command will be

perl Seq_filter.pl -i INPUT.TXT -min 30 -max 60 -0

If I have so many file with thousand of sequences, then I will write the name of those file(one
file name per line) in a text file (file_name.txt, for example then my command will be

perl length_cutoff.pl -I file_name.txt -min 30 -max 60

I can use all command together also like this

perl Seq_filter.pl -i INPUT.TXT -I file_name.txt -min 30 -max 60 -1

Hope this Bioinformatics tutorial would be useful to extract out the sequence from your multi-fasta files in sequence analysis studies.

Update

Yoy can also use this perl script to remove FASTA sequences shorter that your specified number

#!/usr/bin/perl
use strict;
use warnings;

my $minlen = shift or die "Error: `minlen` parameter not provided\n";
{
    local $/=">";
    while(<>) {
        chomp;
        next unless /\w/;
        s/>$//gs;
        my @chunk = split /\n/;
        my $header = shift @chunk;
        my $seqlen = length join "", @chunk;
        print ">$_" if($seqlen >= $minlen);
    }
    local $/="\n";
}

Uses

perl remove_small.pl 200 input.fasta > result.fasta

Here 200 is the length of sequence you want to keep. So all fasta sequences less than 200 in length will be removed from input.fasta and saved into result.fasta.

29 comments:

UnknownDecember 3, 2012 at 3:58 PM
Hi,
When I run this script, it showed "Search pattern not terminated at Seq_filter.pl line 166." I've been trying to debug this search pattern not terminated error, but cannot find where in? The last line at the bottom of is line 160. Can anyone give me any hints on what is wrong? THX.
ReplyDelete
Replies
hhebronMarch 12, 2013 at 3:51 AM
can not get script to work
ReplyDelete
Replies
WobuywMay 22, 2013 at 1:26 PM
Hello,

I'm having trouble with my input text. For some reason, my sequences keep getting split up in multiple segments, even though it's in fasta format.

Thank you for your advice/help
ReplyDelete
Replies
AnonymousJune 16, 2013 at 3:46 PM
Hello,

When I try to run this script on a seqdump.text file which I got from a BlastP-search the script finds 16929 too short sequences, 0 too long sequences and 0 right sized sequences, while the the seqdump.txt file contains only 1913 sequences ?!?
ReplyDelete
Replies
Priyanka PaulJune 17, 2013 at 10:06 PM
Hi,
This PERL script need single line fasta as input. Since your seqdump.text file is multiline fasta (same sequence is in multiple line) that's why it's not working. You can convert your multi line fasta file to single line fasta using this perl script
ReplyDelete
Replies
AnonymousJuly 10, 2013 at 1:56 AM
Thanks so much, this script is great!!!

I have a question: I try to search SNPs from data meta-assambly de novo and the contigs.fasta is my genome of reference. The firts step is to do blastn of Contigs for find the contigs of interest, but I do not know what is the minimal lenght of contigs, Do you have any idea?

Thanks

ReplyDelete
Replies
Priyanka PaulJuly 10, 2013 at 3:55 AM
Follow this bioinformatics tutorial

How to Split Multi line FASTA Files with PERL Script
ReplyDelete
Replies
UnknownNovember 12, 2013 at 1:45 AM
Great tool, thanks,

Is there a way I can produce an output with the prefix of the input file?

Thant wold make it perfect.
ReplyDelete
Replies
UnknownNovember 12, 2013 at 1:46 AM
Hi

Great tool

Is there a way you can make an output with prefix of the input file?
it'll make it perfect.
ReplyDelete
Replies
UnknownDecember 19, 2013 at 8:29 AM
Nice tool, works well, thanks!
ReplyDelete
Replies
AnonymousJanuary 10, 2014 at 7:29 AM
I get the following error
env: perl\r: No such file or directory
any ideas?
ReplyDelete
Replies
UnknownJanuary 27, 2014 at 3:46 PM
How can I install Seq_filter.pl on windows 8.
ReplyDelete
Replies
UnknownJanuary 27, 2014 at 3:48 PM
No, I do not know how to do it.
ReplyDelete
Replies
UnknownJanuary 27, 2014 at 3:51 PM
How can I install Seq_filter.pl on windows 8.
ReplyDelete
Replies
Priyanka PaulJanuary 29, 2014 at 8:57 AM
Hi Mack,
This is a PERL script so you don't have to install it.
All you need to install the Active perl on your computer and run it through your window shell

Easiest Way to Run Perl Script on Windows
ReplyDelete
Replies
JosephJune 15, 2016 at 7:47 PM
Hi the sequencesfilter.pl does not work for my files.
The removesmall.pl is very good and i have run it with my fasta files (Assembled transcriptomes) removing contigs less than 300 and the output files do not have these contigs.

Is there a way to print an output file for contigs with a minimum and maximum lenght for example from 300 to 600 bp using the removesmall. pl. I need to do this task to separate (filter) my fasta files by contig length.
Thanks for your help

ReplyDelete
Replies
JosephJune 15, 2016 at 7:50 PM
Hi the filtersequences.pl did not work with my fasta files (assembled transcriptomes). The removesmall.pl is really good and I have run it with my fasta files removing contigs less than 300 and the output file does not have these contigs.

Is there a way to print an output file for contigs with a minimum and maximum lenght for example from 300 to 600 bp using the removesmall.pl script. I need to do this task to separate (filter) my fasta files by contig length.

Thanks for your help

ReplyDelete
Replies
UnknownSeptember 15, 2016 at 6:16 PM
Hi,
Sorry but I'm really new to bioinformatics. I really want to know how to run this perl script for all of my 40 samples at the same time. I try to correct the name of the output file with the prefix of my input. It's ok now. I try to write a loop in linux but I have always an erro message: "missing $ on loop variable at..."
Thank you so much for your response
ReplyDelete
Replies
sakinahMarch 4, 2019 at 1:44 PM
hi.. the script works great.. how can i split into fasta files while renaming the file with sequence name?
ReplyDelete
Replies

Add comment

Have Problem ?? Drop a comments here!

Pages