Extract Part of a FASTA Sequences with Position
|
Update: 5.29.2018
Bedtools are another nice tool to extract defined regions sequence from FASTA file. Install Bedtools on your Ubuntu machine using these commands
=======================================================================
Actually, I have hundreds of protein sequence and I identified the conserved domain sequence from all those hundreds of protein sequences. Now I got the location of all those domains and want to extract the exact sequence from that locations. So it is easy if I have a single sequence and have the location of one or more domain in my protein but it's very difficult to extract out the domain sequences from many protein sequences with the help of domain location coordinates. I found an easy python script to extracting fasta sequences based on position. I have also shared an online program originally written by Dr Pierre Lindenbaum HERE .
Uses
Remove Empty Fasta Sequences from a file
How to Extract Multiple Sequence from Fasta File
Add FASTA Description to Multiple Sequences
Bedtools are another nice tool to extract defined regions sequence from FASTA file. Install Bedtools on your Ubuntu machine using these commands
sudo apt-get updateand extract sequence by this command
apt-get install bedtools sudo
bedtools getfasta -fi input_fasta -bed id_fileFormats for both fasta file and id files are same as described below.
=======================================================================
Actually, I have hundreds of protein sequence and I identified the conserved domain sequence from all those hundreds of protein sequences. Now I got the location of all those domains and want to extract the exact sequence from that locations. So it is easy if I have a single sequence and have the location of one or more domain in my protein but it's very difficult to extract out the domain sequences from many protein sequences with the help of domain location coordinates. I found an easy python script to extracting fasta sequences based on position. I have also shared an online program originally written by Dr Pierre Lindenbaum HERE .
Example FASTA file with protein sequence
>AT1G01250
MSPQRMKLSSPPVTNNEPTATASAVKSCGGGGKETSSSTTRHPVYHGVRKRRWGKWVSEIREPRKKSRIWLGSFPVPEMAAKAYDVAAFCLKGRKAQLNFPEEIEDLPRPSTCTPRDIQVAAAKAANAVKIIKMGDDDVAGIDDGDDFWEGIELPELMMSGGGWSPEPFVAGDDATWLVDGDLYQYQFMACL
>AT1G03800
MTTEKENVTTAVAVKDGGEKSKEVSDKGVKKRKNVTKALAVNDGGEKSKEVRYRGVRRRPWGRYAAEIRDPVKKKRVWLGSFNTGEEAARAYDSAAIRFRGSKATTNFPLIGYYGISSATPVNNNLSETVSDGNANLPLVGDDGNALASPVNNTLSETARDGTLPSDCHDMLSPGVAEAVAGFFLDLPEVIALKEELDRVCPDQFESIDMGLTIGPQTAVEEPETSSAVDCKLRMEPDLDLNASP
Example ID file with domain location
AT1G01250 45 102
AT1G03800 65 109
Script name | Download |
---|---|
domainseq.py |
Uses
python domainseq.py input.fasta ids.txt > result.fasta
Results >AT1G01250:45-102
IREPRKKSRIWLGSFPVPEMAAKAYDVAAFCLKGRKAQLNFPEEIEDLPRPSTCTPR
>AT1G03800:65-109
AEIRDPVKKKRVWLGSFNTGEEAARAYDSAAIRFRGSKATTNFP
Related Posts HOW TO,
Sequence analysis
|
Was This Post Useful? Add This To Del.icio.us Share on Facebook StumbleUpon This Add to Technorati Share on Twitter |
Labels:
HOW TO,
Sequence analysis
Subscribe to:
Post Comments (Atom)
File "domainseq.py", line 27, in
ReplyDeleteoutname= line[0] + ':' + line[1] + '-' + line[2]
IndexError: list index out of range
Hi sorry you face the problem with extract sub sequence with this python script. You may use this method to Extract Part of a FASTA Sequences with Position
DeleteUse tabs instead of spaces to separate the name and the positions on the ids.txt file.
DeleteThanks, Andre.
DeleteHi,
DeleteI have exactly the same problem and I take care to use only tabs as separator. Any suggestions?
Thank you very much,
Julien
Hi,
DeleteI have exactly the same problem and I take care to use only tabs as separator. Any suggestions?
Thank you very much,
Julien
Hi Julien,
DeleteIs it possible for you to share the part of your data?
How can this error be solved?
ReplyDeleteTraceback (most recent call last) :
File "domainseq.py", line 31, in
print (fasta_dict[line[0][s:e])
Keyerror: 'MyfirstID'
Hi BuckeyePuzzler,
DeleteSorry for your problem. Your name and positions should be separated by 'tab' instead of space. You can download the script again. I have attached an example file both for id and sequence. Hope this will help you.
This comment has been removed by the author.
ReplyDeleteI think line 29 and line 30 should be:
ReplyDeletes= int(line[1])-1
e= int(line[2])-1
I think line 29 and line 30 should be:
ReplyDeletes= int(line[1])-1
e= int(line[2])-1
I have checked this script. It is working. Thanks
DeleteThis comment has been removed by the author.
DeleteHi my sequence name is like this "NW_019011177.1", can not work when I use your command, can you help me to modify it?
ReplyDeleteHi Please remember that your id should be tab limited. You can also send me sample data. However, you may use this method to Extract Part of a FASTA Sequences with Position
Delete