Extract Part of a FASTA Sequences with Position


Update: 5.29.2018
Bedtools are another nice tool to extract defined regions sequence from FASTA file. Install Bedtools on your Ubuntu machine using these commands
sudo apt-get update
apt-get install bedtools sudo
and extract sequence by this command
bedtools getfasta -fi input_fasta -bed id_file
Formats for both fasta file and id files are same as described below.
=======================================================================

Actually, I have hundreds of protein sequence and I identified the conserved domain sequence from all those hundreds of protein sequences. Now I got the location of all those domains and want to extract the exact sequence from that locations. So it is easy if I have a single sequence and have the location of one or more domain in my protein but it's very difficult to extract out the domain sequences from many protein sequences with the help of domain location coordinates. I found an easy python script to extracting fasta sequences based on position. I have also shared an online program originally written by Dr Pierre Lindenbaum  HERE 

Example FASTA file with protein sequence

>AT1G01250 
MSPQRMKLSSPPVTNNEPTATASAVKSCGGGGKETSSSTTRHPVYHGVRKRRWGKWVSEIREPRKKSRIWLGSFPVPEMAAKAYDVAAFCLKGRKAQLNFPEEIEDLPRPSTCTPRDIQVAAAKAANAVKIIKMGDDDVAGIDDGDDFWEGIELPELMMSGGGWSPEPFVAGDDATWLVDGDLYQYQFMACL

>AT1G03800 
MTTEKENVTTAVAVKDGGEKSKEVSDKGVKKRKNVTKALAVNDGGEKSKEVRYRGVRRRPWGRYAAEIRDPVKKKRVWLGSFNTGEEAARAYDSAAIRFRGSKATTNFPLIGYYGISSATPVNNNLSETVSDGNANLPLVGDDGNALASPVNNTLSETARDGTLPSDCHDMLSPGVAEAVAGFFLDLPEVIALKEELDRVCPDQFESIDMGLTIGPQTAVEEPETSSAVDCKLRMEPDLDLNASP
Example ID file with domain location
AT1G01250   45  102
AT1G03800   65  109


Script name Download
domainseq.py

Uses
python domainseq.py input.fasta ids.txt > result.fasta
Results
>AT1G01250:45-102
IREPRKKSRIWLGSFPVPEMAAKAYDVAAFCLKGRKAQLNFPEEIEDLPRPSTCTPR
>AT1G03800:65-109
AEIRDPVKKKRVWLGSFNTGEEAARAYDSAAIRFRGSKATTNFP















  • Remove Empty Fasta Sequences from a file
  • How to Extract Multiple Sequence from Fasta File
  • Add FASTA Description to Multiple Sequences

  • 16 comments:

    1. File "domainseq.py", line 27, in
      outname= line[0] + ':' + line[1] + '-' + line[2]
      IndexError: list index out of range

      ReplyDelete
      Replies
      1. Hi sorry you face the problem with extract sub sequence with this python script. You may use this method to Extract Part of a FASTA Sequences with Position

        Delete
      2. Use tabs instead of spaces to separate the name and the positions on the ids.txt file.

        Delete
      3. Hi,

        I have exactly the same problem and I take care to use only tabs as separator. Any suggestions?
        Thank you very much,

        Julien

        Delete
      4. Hi,

        I have exactly the same problem and I take care to use only tabs as separator. Any suggestions?
        Thank you very much,

        Julien

        Delete
      5. Hi Julien,
        Is it possible for you to share the part of your data?

        Delete
    2. How can this error be solved?
      Traceback (most recent call last) :
      File "domainseq.py", line 31, in
      print (fasta_dict[line[0][s:e])
      Keyerror: 'MyfirstID'


      ReplyDelete
      Replies
      1. Hi BuckeyePuzzler,
        Sorry for your problem. Your name and positions should be separated by 'tab' instead of space. You can download the script again. I have attached an example file both for id and sequence. Hope this will help you.

        Delete
    3. This comment has been removed by the author.

      ReplyDelete
    4. I think line 29 and line 30 should be:
      s= int(line[1])-1
      e= int(line[2])-1

      ReplyDelete
    5. I think line 29 and line 30 should be:
      s= int(line[1])-1
      e= int(line[2])-1

      ReplyDelete
      Replies
      1. I have checked this script. It is working. Thanks

        Delete
      2. This comment has been removed by the author.

        Delete
    6. Hi my sequence name is like this "NW_019011177.1", can not work when I use your command, can you help me to modify it?

      ReplyDelete
      Replies
      1. Hi Please remember that your id should be tab limited. You can also send me sample data. However, you may use this method to Extract Part of a FASTA Sequences with Position

        Delete

    Have Problem ?? Drop a comments here!