Extracting FASTA Sequences Based on Position WITH PERL Script~ Bioinformatics Made Simple.com

Extracting FASTA Sequences Based on Position WITH PERL Script

Previously, I have shared two different methods to extract fasta sequences based on position. For example, if you have coordinates for the nucleotide or amino acid positions and you want obtained the nucleotide or amino acid sequence from that position then you can easily use these methods. I generally predict the domain boundry from several protein sequences from NCBI's CDD database and then use these methods to isolate the amino acid sequence from those domain from several sequences.

Extract Part of a FASTA Sequences with Position by python script HERE
Extract Part of a FASTA Sequences with Position online HERE

Here I am going to share some PERL scrip that can extract the nucleotide or amino acid sequences for the user defined positions.

PERL Script : sub-seq.pl

#!/usr/bin/env perl

#Uses: perl sub-seq.pl input.txt range

use strict;
use warnings;

my $end   = pop;
my $start = pop;
local $/ = '>';

while (<>) {
    chomp;
    next unless /(.+)/;
    my ($header) = "$/$1_$start-$end\n";
    my $seq = ${^POSTMATCH};
    $seq =~ s/\s//g;
    print $header;
    print +( substr $seq, $start - 1, $end ) . "\n";
}

HOW TO SAVE THE PERL SCRIPT HERE

Uses

perl sub-seq.pl input.txt 10 15

Here

input.txt

is the file which contains your nucleotide or amino acid sequences while

10 15

is the range of which you want to extract the sequence. So this script will count the 15 bases/amino acid from position 10 and print them in your result file.

input file

>Seq1
TTTCAACATTATGAAGCCCTTTTTATATATTTTGATTCTGCATCAAAAGCTGAAAATATG
TAGTCTTGAAGTCATTTCGAGAAATCGACGTTTTAAGTTTCTGTTTCCAAATTCAAACGG
ATGTATCTTCGCCAATAATTGTCAGAAGTTAGAATTTCTTTCAACATTATGAAGCCCTTT
TTATATATTTTGATTCTGCATCAAAAGCTGAAAATGTGTAGTCTCGAAGTCATTTCGAGA
TGCATCAAAAGCTGAAAATATGTAGTCGAGAAGTCATTTCGAGAAATTGACGTTTTAAGT
TTCGGTTTCCAAATTCAACCGGATGTATCTTCGCCAATAATTGTCAGCAGTTAGAATTTC
>Seq2
TTATATATTTTGATTCTGCATCAAAAGCTGAAAATGTGTAGTCTCGAAGTCATTTCGAGA
AATTGACGTTTTAAGTTTCTGTTTCCAAATTCAAACGGATGTATCTTCGCCAATAATTGT
CAGAAGTTAGAATTTCTTTCAACATTATGAAGCCCTTTTTACATATTTTGACCCTGCATC
AAAAGCTGAAAATATGTAGTCTCGAAGTCATTTTGAGAAGTTAGAATTTCTTTCAACATT
ATGAAGCCCTTTTTATATATTTTGATTCTGCATCAAAAGCTGAAAATATGTAGTCTCGAA
GTCWTTTCRAGAAATTGACGTTTTAAGTTTCTGTTTCCAAATTCAAACGGATGTATCTTC
GCCAATAATTGTCAGAAGTTAGAATTTCTTTCAACATTATGAAGCCCTTTTTATATATTT
TGACTCTGCATCAAAAGCTGAAAATATGTAGTCTCGAAGTCATTTCGAGAAATTGACGTT

Result

>seq1_10-15
AGCTGAAAATATGTA
>seq2_10-15
AGCTGAAAATATGTA

3 comments:

UnknownMarch 8, 2018 at 9:45 AM
Traceback (most recent call last):
File "domainseq.py", line 31, in
print(fasta_dict[line[0]][s:e])
KeyError: 'NW_019014662.1'

I got such problem, seem some wrong with the name, can you help me?
ReplyDelete
Replies
UnknownMarch 8, 2018 at 9:45 AM
Traceback (most recent call last):
File "domainseq.py", line 31, in
print(fasta_dict[line[0]][s:e])
KeyError: 'NW_019014662.1'

can you help me
ReplyDelete
Replies

Add comment

Have Problem ?? Drop a comments here!

Pages