Extracting FASTA Sequences Based on Position WITH PERL Script
|
Previously, I have shared two different methods to extract fasta sequences based on position. For example, if you have coordinates for the nucleotide or amino acid positions and you want obtained the nucleotide or amino acid sequence from that position then you can easily use these methods. I generally predict the domain boundry from several protein sequences from NCBI's CDD database and then use these methods to isolate the amino acid sequence from those domain from several sequences.
Extract Part of a FASTA Sequences with Position by python script HERE
Extract Part of a FASTA Sequences with Position online HERE
Here I am going to share some PERL scrip that can extract the nucleotide or amino acid sequences for the user defined positions.
Extract Part of a FASTA Sequences with Position online HERE
PERL Script : sub-seq.pl
#!/usr/bin/env perl #Uses: perl sub-seq.pl input.txt range use strict; use warnings; my $end = pop; my $start = pop; local $/ = '>'; while (<>) { chomp; next unless /(.+)/; my ($header) = "$/$1_$start-$end\n"; my $seq = ${^POSTMATCH}; $seq =~ s/\s//g; print $header; print +( substr $seq, $start - 1, $end ) . "\n"; }
HOW TO SAVE THE PERL SCRIPT HERE
Uses
perl sub-seq.pl input.txt 10 15Here
input.txtis the file which contains your nucleotide or amino acid sequences while
10 15
is the range of which you want to extract the sequence. So this script will count the 15 bases/amino acid from position 10 and print them in your result file.input file
>Seq1 TTTCAACATTATGAAGCCCTTTTTATATATTTTGATTCTGCATCAAAAGCTGAAAATATG TAGTCTTGAAGTCATTTCGAGAAATCGACGTTTTAAGTTTCTGTTTCCAAATTCAAACGG ATGTATCTTCGCCAATAATTGTCAGAAGTTAGAATTTCTTTCAACATTATGAAGCCCTTT TTATATATTTTGATTCTGCATCAAAAGCTGAAAATGTGTAGTCTCGAAGTCATTTCGAGA TGCATCAAAAGCTGAAAATATGTAGTCGAGAAGTCATTTCGAGAAATTGACGTTTTAAGT TTCGGTTTCCAAATTCAACCGGATGTATCTTCGCCAATAATTGTCAGCAGTTAGAATTTC >Seq2 TTATATATTTTGATTCTGCATCAAAAGCTGAAAATGTGTAGTCTCGAAGTCATTTCGAGA AATTGACGTTTTAAGTTTCTGTTTCCAAATTCAAACGGATGTATCTTCGCCAATAATTGT CAGAAGTTAGAATTTCTTTCAACATTATGAAGCCCTTTTTACATATTTTGACCCTGCATC AAAAGCTGAAAATATGTAGTCTCGAAGTCATTTTGAGAAGTTAGAATTTCTTTCAACATT ATGAAGCCCTTTTTATATATTTTGATTCTGCATCAAAAGCTGAAAATATGTAGTCTCGAA GTCWTTTCRAGAAATTGACGTTTTAAGTTTCTGTTTCCAAATTCAAACGGATGTATCTTC GCCAATAATTGTCAGAAGTTAGAATTTCTTTCAACATTATGAAGCCCTTTTTATATATTT TGACTCTGCATCAAAAGCTGAAAATATGTAGTCTCGAAGTCATTTCGAGAAATTGACGTT
Result
>seq1_10-15 AGCTGAAAATATGTA >seq2_10-15 AGCTGAAAATATGTA
Related Posts HOW TO,
Perl Script,
Sequence analysis
|
Was This Post Useful? Add This To Del.icio.us Share on Facebook StumbleUpon This Add to Technorati Share on Twitter |
Labels:
HOW TO,
Perl Script,
Sequence analysis
Subscribe to:
Post Comments (Atom)
Traceback (most recent call last):
ReplyDeleteFile "domainseq.py", line 31, in
print(fasta_dict[line[0]][s:e])
KeyError: 'NW_019014662.1'
I got such problem, seem some wrong with the name, can you help me?
Hi,
Deletesorry that you got this error. Please make sure that you are giving correct range to script and sequence is in fasta format. You can use this online tools also
http://bioinfo-tool.blogspot.com/2015/08/extracting-fasta-sequences-based-on.html
or this python script
http://www.bioinformatics-made-simple.com/2013/10/actually-i-have-hundreds-of-protein.html
Hope this will be helpful.
Traceback (most recent call last):
ReplyDeleteFile "domainseq.py", line 31, in
print(fasta_dict[line[0]][s:e])
KeyError: 'NW_019014662.1'
can you help me