How to Remove Duplicate Sequences from a multi fasta Files
|
Most common problems with sequence analysis is presence of duplicate sequence in data sets. Even protein or nucleotide sequences downloaded from NCBI or other similar databases may contain duplicate entries. Some FASTA files may have sequences with different IDs that nonetheless have the same sequence and presence of duplicate entries may lead to incorrect results. So lets talk about some utility to remove duplicate sequences from a multi fasta file.
Dependencies
Uses
I. CD-HIT
II. DNA Baser
III. Duplicates Finder
RemoveRep.pl |
---|
Dependencies
Bio::Perl
Bio::SeqIO
Uses
removerep.pl input.txt output.txt
RemoveRep2.pl |
---|
Uses
you input sequences should be in input.txt
3. Other options
Some other nice free bioinformatics software program you may want to give a try to remove duplicate or redundant sequence sequences from your datasets.I. CD-HIT
II. DNA Baser
III. Duplicates Finder
Related Posts HOW TO,
PERL,
Perl Script,
Sequence analysis
|
Was This Post Useful? Add This To Del.icio.us Share on Facebook StumbleUpon This Add to Technorati Share on Twitter |
Labels:
HOW TO,
PERL,
Perl Script,
Sequence analysis
Subscribe to:
Post Comments (Atom)
Thanks for this, it worked perfectly to remove duplicated sequences from a combined NCBI reference database
ReplyDeleteThe link to download is not working
ReplyDeleteSorry for your inconvenience. You can directly save the source code from post. Thanks
Delete