Extract Part of a FASTA Sequences with Position~ Bioinformatics Made Simple.com

Extract Part of a FASTA Sequences with Position

Update: 5.29.2018
Bedtools are another nice tool to extract defined regions sequence from FASTA file. Install Bedtools on your Ubuntu machine using these commands

sudo apt-get update
apt-get install bedtools sudo

and extract sequence by this command

bedtools getfasta -fi input_fasta -bed id_file

Formats for both fasta file and id files are same as described below.
=======================================================================

Actually, I have hundreds of protein sequence and I identified the conserved domain sequence from all those hundreds of protein sequences. Now I got the location of all those domains and want to extract the exact sequence from that locations. So it is easy if I have a single sequence and have the location of one or more domain in my protein but it's very difficult to extract out the domain sequences from many protein sequences with the help of domain location coordinates. I found an easy python script to extracting fasta sequences based on position. I have also shared an online program originally written by Dr Pierre Lindenbaum HERE .

Example FASTA file with protein sequence


>AT1G01250 
MSPQRMKLSSPPVTNNEPTATASAVKSCGGGGKETSSSTTRHPVYHGVRKRRWGKWVSEIREPRKKSRIWLGSFPVPEMAAKAYDVAAFCLKGRKAQLNFPEEIEDLPRPSTCTPRDIQVAAAKAANAVKIIKMGDDDVAGIDDGDDFWEGIELPELMMSGGGWSPEPFVAGDDATWLVDGDLYQYQFMACL

>AT1G03800 
MTTEKENVTTAVAVKDGGEKSKEVSDKGVKKRKNVTKALAVNDGGEKSKEVRYRGVRRRPWGRYAAEIRDPVKKKRVWLGSFNTGEEAARAYDSAAIRFRGSKATTNFPLIGYYGISSATPVNNNLSETVSDGNANLPLVGDDGNALASPVNNTLSETARDGTLPSDCHDMLSPGVAEAVAGFFLDLPEVIALKEELDRVCPDQFESIDMGLTIGPQTAVEEPETSSAVDCKLRMEPDLDLNASP

Example ID file with domain location

AT1G01250   45  102
AT1G03800   65  109

Script name	Download
domainseq.py

Uses

python domainseq.py input.fasta ids.txt > result.fasta

Results

>AT1G01250:45-102
IREPRKKSRIWLGSFPVPEMAAKAYDVAAFCLKGRKAQLNFPEEIEDLPRPSTCTPR
>AT1G03800:65-109
AEIRDPVKKKRVWLGSFNTGEEAARAYDSAAIRFRGSKATTNFP

Remove Empty Fasta Sequences from a file

How to Extract Multiple Sequence from Fasta File

Add FASTA Description to Multiple Sequences

16 comments:

utpalmtbiFebruary 17, 2014 at 3:28 PM
File "domainseq.py", line 27, in
outname= line[0] + ':' + line[1] + '-' + line[2]
IndexError: list index out of range
ReplyDelete
Replies
AnonymousNovember 20, 2014 at 9:39 PM
How can this error be solved?
Traceback (most recent call last) :
File "domainseq.py", line 31, in
print (fasta_dict[line[0][s:e])
Keyerror: 'MyfirstID'

ReplyDelete
Replies
AnonymousNovember 20, 2014 at 9:41 PM
This comment has been removed by a blog administrator.
ReplyDelete
Replies
UnknownAugust 8, 2015 at 1:04 PM
I think line 29 and line 30 should be:
s= int(line[1])-1
e= int(line[2])-1
ReplyDelete
Replies
UnknownAugust 8, 2015 at 1:04 PM
I think line 29 and line 30 should be:
s= int(line[1])-1
e= int(line[2])-1
ReplyDelete
Replies
UnknownMarch 9, 2018 at 3:23 AM
Hi my sequence name is like this "NW_019011177.1", can not work when I use your command, can you help me to modify it?
ReplyDelete
Replies