How to Remove Duplicate Sequences from a multi fasta Files~ Bioinformatics Made Simple.com

How to Remove Duplicate Sequences from a multi fasta Files

Most common problems with sequence analysis is presence of duplicate sequence in data sets. Even protein or nucleotide sequences downloaded from NCBI or other similar databases may contain duplicate entries. Some FASTA files may have sequences with different IDs that nonetheless have the same sequence and presence of duplicate entries may lead to incorrect results. So lets talk about some utility to remove duplicate sequences from a multi fasta file.

RemoveRep.pl

Dependencies

Bio::Perl
Bio::SeqIO

Uses

removerep.pl input.txt output.txt

RemoveRep2.pl

Uses

you input sequences should be in input.txt

3. Other options

Some other nice free bioinformatics software program you may want to give a try to remove duplicate or redundant sequence sequences from your datasets.

I. CD-HIT

II. DNA Baser

III. Duplicates Finder

3 comments:

AnonymousJuly 3, 2013 at 6:40 AM
Thanks for this, it worked perfectly to remove duplicated sequences from a combined NCBI reference database
ReplyDelete
Replies
ValeriaMarch 3, 2017 at 4:36 AM
The link to download is not working
ReplyDelete
Replies

Add comment

Have Problem ?? Drop a comments here!

Pages