Extracting FASTA Sequences Based on Position WITH PERL Script


Previously, I have shared two different methods to extract fasta sequences based on position. For example, if you have coordinates for the nucleotide or amino acid positions and you want obtained the nucleotide or amino acid sequence from that position then you can easily use these methods. I generally predict the domain boundry from several protein sequences from NCBI's CDD database and then use these methods to isolate the amino acid sequence from those domain from several sequences.
Extract Part of a FASTA Sequences with Position by python script HERE
 Extract Part of a FASTA Sequences with Position online HERE
Here I am going to share some PERL scrip that can extract the nucleotide or amino acid sequences for the user defined positions.

PERL Script : sub-seq.pl

#!/usr/bin/env perl

#Uses: perl sub-seq.pl input.txt range

use strict;
use warnings;

my $end   = pop;
my $start = pop;
local $/ = '>';

while (<>) {
    chomp;
    next unless /(.+)/;
    my ($header) = "$/$1_$start-$end\n";
    my $seq = ${^POSTMATCH};
    $seq =~ s/\s//g;
    print $header;
    print +( substr $seq, $start - 1, $end ) . "\n";
}

HOW TO SAVE THE PERL SCRIPT HERE

Uses

perl sub-seq.pl input.txt 10 15
Here
input.txt
is the file which contains your nucleotide or amino acid sequences while
10 15
is the range of which you want to extract the sequence. So this script will count the 15 bases/amino acid from position 10 and print them in your result file.

input file

>Seq1
TTTCAACATTATGAAGCCCTTTTTATATATTTTGATTCTGCATCAAAAGCTGAAAATATG
TAGTCTTGAAGTCATTTCGAGAAATCGACGTTTTAAGTTTCTGTTTCCAAATTCAAACGG
ATGTATCTTCGCCAATAATTGTCAGAAGTTAGAATTTCTTTCAACATTATGAAGCCCTTT
TTATATATTTTGATTCTGCATCAAAAGCTGAAAATGTGTAGTCTCGAAGTCATTTCGAGA
TGCATCAAAAGCTGAAAATATGTAGTCGAGAAGTCATTTCGAGAAATTGACGTTTTAAGT
TTCGGTTTCCAAATTCAACCGGATGTATCTTCGCCAATAATTGTCAGCAGTTAGAATTTC
>Seq2
TTATATATTTTGATTCTGCATCAAAAGCTGAAAATGTGTAGTCTCGAAGTCATTTCGAGA
AATTGACGTTTTAAGTTTCTGTTTCCAAATTCAAACGGATGTATCTTCGCCAATAATTGT
CAGAAGTTAGAATTTCTTTCAACATTATGAAGCCCTTTTTACATATTTTGACCCTGCATC
AAAAGCTGAAAATATGTAGTCTCGAAGTCATTTTGAGAAGTTAGAATTTCTTTCAACATT
ATGAAGCCCTTTTTATATATTTTGATTCTGCATCAAAAGCTGAAAATATGTAGTCTCGAA
GTCWTTTCRAGAAATTGACGTTTTAAGTTTCTGTTTCCAAATTCAAACGGATGTATCTTC
GCCAATAATTGTCAGAAGTTAGAATTTCTTTCAACATTATGAAGCCCTTTTTATATATTT
TGACTCTGCATCAAAAGCTGAAAATATGTAGTCTCGAAGTCATTTCGAGAAATTGACGTT

Result

>seq1_10-15
AGCTGAAAATATGTA
>seq2_10-15
AGCTGAAAATATGTA


3 comments:

  1. Traceback (most recent call last):
    File "domainseq.py", line 31, in
    print(fasta_dict[line[0]][s:e])
    KeyError: 'NW_019014662.1'

    I got such problem, seem some wrong with the name, can you help me?

    ReplyDelete
    Replies
    1. Hi,
      sorry that you got this error. Please make sure that you are giving correct range to script and sequence is in fasta format. You can use this online tools also
      http://bioinfo-tool.blogspot.com/2015/08/extracting-fasta-sequences-based-on.html

      or this python script
      http://www.bioinformatics-made-simple.com/2013/10/actually-i-have-hundreds-of-protein.html


      Hope this will be helpful.

      Delete
  2. Traceback (most recent call last):
    File "domainseq.py", line 31, in
    print(fasta_dict[line[0]][s:e])
    KeyError: 'NW_019014662.1'

    can you help me

    ReplyDelete

Have Problem ?? Drop a comments here!