Perl Script 3: How to remove non amino acid characters from a multi fasta file

                                                     
Question : I have been trying to remove non sequence data from a  protein sequences fasta file. My input file is given where I am trying to remove digits and white space as I only need the accession number and sequence:
>P67870
MSSSEEVSWISWFCGLRGNEFFCEVDEDYIQDKFNLTGLNEQVPHYRQALDMILDLEPDEELEDNPNQSDLIEQAAEMLYGLIHARYILTNRGIAQMLEKYQQGDFGYCPRVYCENQPMLPIGLSDIPGEAMVKLYCPKCMDVYTPKSSRHHHTDGAYFGTGFPHMLFMVHPEYRPKRPANQFVPRLYGFKIHPMAYQLQLQAASNFKSPVKTIR 209 S 7578274 CDK LTP 2004-12-31 00:00:00+01

>Q99640
MLERPPALAMPMPTEGTPPPLSGTPIPVPAYFRHAEPGFSLKRPRGLSRSLPPPPPAKGSIPISRLFPPRTPGWHQLQPRRVSFRGEASETLQSPGYDPSRPESFFQQSFQRLSRLGHGSYGEVFKVRSKEDGRLYAVKRSMSPFRGPKDRARKLAEVGSHEKVGQHPCCVRLEQAWEEGGILYLQTELCGPSLQQHCEAWGASLPEAQVWGYLRDTLLALAHLHSQGLVHLDVKPANIFLGPRGRCKLGDFGLLVELGTAGAGEVQEGDPRYMAPELLQGSYGTAADVFSLGLTILEVACNMELPHGGEGWQQLRQGYLPPEFTAGLSSELRSVLVMMLEPDPKLRATAEALLALPVLRQPRAWGVLWCMAAEALSRGWALWQALLALLCWLWHGLAHPASWLQPLGPPATPPGSPPCSLLLDSSLSSNWDDDSLGPSLSPEAVLARTVGSTSTPRSRCTPRDALDLSDINSEPPRGSFPSFEPRNLLSLFEDTLDPT 426 S 12738781 PLK1 LTP 2004-12-31 00:00:00+01

>P10747
MLRLLLALNLFPSIQVTGNKILVKQSPMLVAYDNAVNLSCKYSYNLFSREFRASLHKGLDSAVEVCVVYGNYSQQLQVYSKTGFNCDGKLGNESVTFYLQNLYVNQTDIYFCKIEVMYPPPYLDNEKSNGTIIHVKGKHLCPSPLFPGPSKPFWVLVVVGGVLACYSLLVTVAFIIFWVRSKRSRLLHSDYMNMTPRRPGPTRKHYQPYAPPRDFAAYRS 191 Y 8992971 Lck;ITK LTP 2004-12-31 00:00:00+01

Answer : Considering that name of your FASTA sequence containing file is "input" and want to save your result into 'output.txt', you can use either of these PERL script. First of all save these PERL script separately and then run on you machine
PERL Script 1 : 

#!/usr/bin/perl -w
#give in result in a multi line fasta
use strict;

open (IN, "input.txt") or die "Can't open input.txt: $!\n";
open (OUT, ">output.txt") or die "Can't open skara.txt: $!\n";

while () {
s/\s+\d+.+\n/\n/;
print OUT;
}
PERL Script 2 : 

#!/usr/bin/perl -w
#give in result in a single line fasta

open (IN, "input.txt") or die "Can't open input.txt: $!\n";
open (OUT, ">output.txt") or die "Can't open input.txt: $!\n";

while () {
s/^>(.*)$/\n>$1\n/;
s/\s+\d+.+\n/\n/;
chomp;
print OUT unless /^(\w{3}\s){2}([\d:+-])+/;
}

Source : protocol-online

No comments:

Post a Comment

Have Problem ?? Drop a comments here!