How to Filter the Sequence by Their Length

PERL scripts are most wonderful stuff because they can help you in almost all aspect of Bioinformatics. PERL script can perform BLAST, parsing, gene prediction and many more for you in very simple way even without going to a particular website. Here let me share another beautiful work of Dr. David Rosenkranz, Institue of Anthropology Johannes Gutenberg University Mainz, Germany. Suppose you have thousand of genes in a file in FASTA format and want to delete or save (in another file) those sequences which are shorter than 300 bp or amino acid then this PERL script will save your life from editing file manually. 

  1. PERL Script
  2. Options
  3. Uses

Script name Download
Seq_filter.pl

This PERL script will generate three files
sequence length ok - Sequence with defined length
sequence too short - Sequence shorter than defined length
sequence too long - Sequence longer than defined length
Options

-i   = File name with sequences
-min = Minimum length of sequences
-max = Maximum length of sequences
-0   = Remove sequences and write to output file sequences_too_long.fas
-1   = Cut the end of the sequence and save to file sequences_ok.fas
-I   = File containing ist of files with sequences
Uses
If my sequences are stored in INPUT.TXT and I want to extract sequences with minimum length 30 and maximum length 60 then my command will be
perl Seq_filter.pl -i INPUT.TXT -min 30 -max 60

If I want to cut the end of the sequence and save to file sequences_ok.fas fiile then my command will be

perl Seq_filter.pl -i INPUT.TXT -min 30 -max 60 -1

If I want to remove sequences and write to output file sequences_too_long.fas then my command will be

perl Seq_filter.pl -i INPUT.TXT -min 30 -max 60 -0

If I have so many file with thousand of sequences, then I will write the name of those file(one
file name per line) in a text file (file_name.txt, for example then my command will be

perl length_cutoff.pl -I file_name.txt -min 30 -max 60
I can use all command together also like this

perl Seq_filter.pl -i INPUT.TXT -I file_name.txt -min 30 -max 60 -1


Hope this Bioinformatics tutorial would be useful to extract out the sequence from your multi-fasta files in sequence analysis studies.

Update

Yoy can also use this perl script to remove FASTA sequences shorter that your specified number
#!/usr/bin/perl
use strict;
use warnings;

my $minlen = shift or die "Error: `minlen` parameter not provided\n";
{
    local $/=">";
    while(<>) {
        chomp;
        next unless /\w/;
        s/>$//gs;
        my @chunk = split /\n/;
        my $header = shift @chunk;
        my $seqlen = length join "", @chunk;
        print ">$_" if($seqlen >= $minlen);
    }
    local $/="\n";
}

Uses
perl remove_small.pl 200 input.fasta > result.fasta
Here 200 is the length of sequence you want to keep. So all fasta sequences less than 200 in length will be removed from input.fasta and saved into result.fasta.

29 comments:

  1. Hi,
    When I run this script, it showed "Search pattern not terminated at Seq_filter.pl line 166." I've been trying to debug this search pattern not terminated error, but cannot find where in? The last line at the bottom of is line 160. Can anyone give me any hints on what is wrong? THX.

    ReplyDelete
  2. Replies
    1. Please download the PERL scripts and use. It will work.

      Delete
  3. Hello,

    I'm having trouble with my input text. For some reason, my sequences keep getting split up in multiple segments, even though it's in fasta format.

    Thank you for your advice/help

    ReplyDelete
  4. Hello,

    When I try to run this script on a seqdump.text file which I got from a BlastP-search the script finds 16929 too short sequences, 0 too long sequences and 0 right sized sequences, while the the seqdump.txt file contains only 1913 sequences ?!?

    ReplyDelete
  5. Hi,
    This PERL script need single line fasta as input. Since your seqdump.text file is multiline fasta (same sequence is in multiple line) that's why it's not working. You can convert your multi line fasta file to single line fasta using this perl script

    ReplyDelete
  6. Thanks so much, this script is great!!!

    I have a question: I try to search SNPs from data meta-assambly de novo and the contigs.fasta is my genome of reference. The firts step is to do blastn of Contigs for find the contigs of interest, but I do not know what is the minimal lenght of contigs, Do you have any idea?

    Thanks

    ReplyDelete
  7. Great tool, thanks,

    Is there a way I can produce an output with the prefix of the input file?

    Thant wold make it perfect.

    ReplyDelete
  8. Hi

    Great tool

    Is there a way you can make an output with prefix of the input file?
    it'll make it perfect.

    ReplyDelete
  9. I get the following error
    env: perl\r: No such file or directory
    any ideas?

    ReplyDelete
    Replies
    1. I am not sure but do you set the environmental variable for PERL on your machine?

      Delete
  10. How can I install Seq_filter.pl on windows 8.

    ReplyDelete
  11. How can I install Seq_filter.pl on windows 8.

    ReplyDelete
  12. Hi Mack,
    This is a PERL script so you don't have to install it.
    All you need to install the Active perl on your computer and run it through your window shell

    Easiest Way to Run Perl Script on Windows

    ReplyDelete
  13. Hi the sequencesfilter.pl does not work for my files.
    The removesmall.pl is very good and i have run it with my fasta files (Assembled transcriptomes) removing contigs less than 300 and the output files do not have these contigs.

    Is there a way to print an output file for contigs with a minimum and maximum lenght for example from 300 to 600 bp using the removesmall. pl. I need to do this task to separate (filter) my fasta files by contig length.
    Thanks for your help

    ReplyDelete
    Replies
    1. Hi,
      Seq_filter.pl will do it for you. Please make sure that you have installed the bioperl on your machine.

      Delete
  14. Hi the filtersequences.pl did not work with my fasta files (assembled transcriptomes). The removesmall.pl is really good and I have run it with my fasta files removing contigs less than 300 and the output file does not have these contigs.

    Is there a way to print an output file for contigs with a minimum and maximum lenght for example from 300 to 600 bp using the removesmall.pl script. I need to do this task to separate (filter) my fasta files by contig length.

    Thanks for your help

    ReplyDelete
    Replies
    1. Hi,
      As I stated above, you can use the script like this
      perl remove_small.pl 200 input.fasta >result.fasta

      You result should be in the result.fasta

      Delete
  15. Hi,
    Sorry but I'm really new to bioinformatics. I really want to know how to run this perl script for all of my 40 samples at the same time. I try to correct the name of the output file with the prefix of my input. It's ok now. I try to write a loop in linux but I have always an erro message: "missing $ on loop variable at..."
    Thank you so much for your response

    ReplyDelete
  16. hi.. the script works great.. how can i split into fasta files while renaming the file with sequence name?

    ReplyDelete

Have Problem ?? Drop a comments here!