How to Extract Multiple Sequence from Multi Fasta File with PERL - II

Previously I have shared a PERL script to extract multiple sequence from multi FASTA file with PERL. So if you have accession numbers stored in a file and sequence in another file then you can fetch the sequence with the help of that PERL script. But here situaion is different. Here we have FASTA sequences (sequence.txt) in a file and accession numbers/IDs (ID.txt) in different file but the IDs are given in different row and we want to extract the FASTA sequences according the IDs grouped in different row and store in to different files (out_1, out_2, out_3).

SCRIPT 1 : extract-seq.PL

#!/usr/bin/perl
use strict;
use warnings;

my ( %list, %FHs, $id );

while (<>) {
    $list{$_} = "out_$." for split;
    last if eof;
}

local $/ = '>';
while (<>) {
    chomp;
    if ( ($id) = /(.+)/ and exists $list{$id} ) {
        open $FHs{ $list{$id} }, '>', $list{$id} or die $! unless defined $FHs{ $list{$id} };
        print { $FHs{ $list{$id} } } ">$_";
    }
}

Uses

perl extract-seq.pl id.txt sequence.txt 

Input

Sequences

>Seq1
TGCTCCCGGCCGCACTGGCGGCCGCGGGAATTCGATTCGACAAGGCGTTGGTGCTGCCCACAAAGGC
CAGTTCGATATCGCGCTCGCGGGTTTGCAACTGCAACAGGCTCTGGTCATAGCCTTTGGGCACGAAC
ACCGCATCAAAGCCTTCTTCGCGCAGGCGTTCGCTGACCATGAAGCCCGAACAGATCACCCGCGCCC
AGGGCAGCTTGCGATAGTGCGCGCTGAACTTGCCGGTGTACTTGCAGGGGATGTAGTTCTGGTAGGC
ATCGTGTTCAAGGATGACCAGATTGGGAATCGTGCGGATGAACCCGACCTGACGGACTTCCTGCTTG
AAGCGCAGAAAAAACACGATCCGGTCATAACGCTCGACATCCACTTCACGGCGGAAATAGCCACGCA
AGGTTGCGCTGCTCATCGGAGCTCCAGCCCAACCGCACCTCGCACTCACAATACGCGGCGATGCCCT
TCATAAAGACGGTCGAGAATGGCCCGCTGCTCTTTCTTGGACCCCAGAAGTANAAACCTTTTCATGG
GGTNTTCCCTTGCCAGTTACCTGCGCCCCTGCCTGAAATCACGATATT
>Seq2
TGCTCCCGGCCGCACTGGCGGCCGCGGGAATTCGATTCGACAAGGCGTTGGTGCTGCCCACAAAGGCCAGTTCGATATCGCGCTCGCGGGTTTGCAACTGCAACAGGCTCTGGTCATAGCCTTTGGGCACGAACACCGCATCAAAGCCTTCTTCGCGCAGGCGTTCGCTGACCATGAAGCCCGAACAGATCACCCGCGCCCAGGGCAGCTTGCGATAGTGCGCGCTGAACTTGCCGGTGTACTTGCAGGGGATGTAGTTCTGGTAGGCATCGTGTTCAAGGATGACCAGATTGGGAATCGTGCGGATGAACCCGACCTGACGGACTTCCTGCTTGAAGCGCAGAAAAAACACGATCCGGTCATAACGCTCGACATCCACTTCACGGCGGAAATAGCCACGCAAGGTTGCGCTGCTCATCGGAGCTCCAGCCCAACCGCACCTCGCACTCACAATACGCGGCGATGCCCTTCATAAAGACGGTCGAGAATGGCCCGCTGCTCTTTCTTGGACCCCAGAAGTANAAACCTTTTCATGGGGTNTTCCCTTGCCAGTTACCTGCGCCCCTGCCTGAAATCACGATATT
>Seq3
TGCTCCCGGCCGCACTGGCGGCCGCGGGAATTCGATTCGACAAGGCGTTGGTGCTGCCCACAAAGGCCAGTTCGATATCGCGCTCGCGGGTTTGCAACTGCAACAGGCTCTGGTCATAGCCTTTGGGCACGAACACCGCATCAAAGCCTTCTTCGCGCAGGCGTTCGCTGACCATGAAGCCCGAACAGATCACCCGCGCCCAGGGCAGCTTGCGATAGTGCGCGCTGAACTTGCCGGTGTACTTGCAGGGGATGTAGTTCTGGTAGGCATCGTGTTCAAGGATGACCAGATTGGGAATCGTGCGGATGAACCCGACCTGACGGACTTCCTGCTTGAAGCGCAGAAAAAACACGATCCGGTCATAACGCTCGACATCCACTTCACGGCGGAAATAGCCACGCAAGGTTGCGCTGCTCATCGGAGCTCCAGCCCAACCGCACCTCGCACTCACAATACGCGGCGATGCCCTTCATAAAGACGGTCGAGAATGGCCCGCTGCTCTTTCTTGGACCCCAGAAGTANAAACCTTTTCATGGGGTNTTCCCTTGCCAGTTACCTGCGCCCCTGCCTGAAATCACGATATT
>Seq4
TGCTCCCGGCCGCACTGGCGGCCGCGGGAATTCGATTCGACAAGGCGTTGGTGCTGCCCACAAAGGCCAGTTCGATATCGCGCTCGCGGGTTTGCAACTGCAACAGGCTCTGGTCATAGCCTTTGGGCACGAACACCGCATCAAAGCCTTCTTCGCGCAGGCGTTCGCTGACCATGAAGCCCGAACAGATCACCCGCGCCCAGGGCAGCTTGCGATAGTGCGCGCTGAACTTGCCGGTGTACTTGCAGGGGATGTAGTTCTGGTAGGCATCGTGTTCAAGGATGACCAGATTGGGAATCGTGCGGATGAACCCGACCTGACGGACTTCCTGCTTGAAGCGCAGAAAAAACACGATCCGGTCATAACGCTCGACATCCACTTCACGGCGGAAATAGCCACGCAAGGTTGCGCTGCTCATCGGAGCTCCAGCCCAACCGCACCTCGCACTCACAATACGCGGCGATGCCCTTCATAAAGACGGTCGAGAATGGCCCGCTGCTCTTTCTTGGACCCCAGAAGTANAAACCTTTTCATGGGGTNTTCCCTTGCCAGTTACCTGCGCCCCTGCCTGAAATCACGATATT
>Seq5
TGCTCCCGGCCGCACTGGCGGCCGCGGGAATTCGATTCGACAAGGCGTTGGTGCTGCCCACAAAGGCCAGTTCGATATCGCGCTCGCGGGTTTGCAACTGCAACAGGCTCTGGTCATAGCCTTTGGGCACGAACACCGCATCAAAGCCTTCTTCGCGCAGGCGTTCGCTGACCATGAAGCCCGAACAGATCACCCGCGCCCAGGGCAGCTTGCGATAGTGCGCGCTGAACTTGCCGGTGTACTTGCAGGGGATGTAGTTCTGGTAGGCATCGTGTTCAAGGATGACCAGATTGGGAATCGTGCGGATGAACCCGACCTGACGGACTTCCTGCTTGAAGCGCAGAAAAAACACGATCCGGTCATAACGCTCGACATCCACTTCACGGCGGAAATAGCCACGCAAGGTTGCGCTGCTCATCGGAGCTCCAGCCCAACCGCACCTCGCACTCACAATACGCGGCGATGCCCTTCATAAAGACGGTCGAGAATGGCCCGCTGCTCTTTCTTGGACCCCAGAAGTANAAACCTTTTCATGGGGTNTTCCCTTGCCAGTTACCTGCGCCCCTGCCTGAAATCACGATAT
>Seq6
TGCTCCCGGCCGCACTGGCGGCCGCGGGAATTCGATTCGACAAGGCGTTGGTGCTGCCCACAAAGGCCAGTTCGATATCGCGCTCGCGGGTTTGCAACTGCAACAGGCTCTGGTCATAGCCTTTGGGCACGAACACCGCATCAAAGCCTTCTTCGCGCAGGCGTTCGCTGACCATGAAGCCCGAACAGATCACCCGCGCCCAGGGCAGCTTGCGATAGTGCGCGCTGAACTTGCCGGTGTACTTGCAGGGGATGTAGTTCTGGTAGGCATCGTGTTCAAGGATGACCAGATTGGGAATCGTGCGGATGAACCCGACCTGACGGACTTCCTGCTTGAAGCGCAGAAAAAACACGATCCGGTCATAACGCTCGACATCCACTTCACGGCGGAAATAGCCACGCAAGGTTGCGCTGCTCATCGGAGCTCCAGCCCAACCGCACCTCGCACTCACAATACGCGGCGATGCCCTTCATAAAGACGGTCGAGAATGGCCCGCTGCTCTTTCTTGGACCCCAGAAGTANAAACCTTTTCATGGGGTNTTCCCTTGCCAGTTACCTGCGCCCCTGCCTGAAATCACGATATT

IDs

Seq1 Seq2 Seq3
Seq4 Seq5
Seq6 
Convert Multi Fasta file into a Single line FASTA File HERE
How to add specific word to fasta header HERE

Results

out_1

Seq1 Seq2 Seq3
>Seq1
TGCTCCCGGCCGCACTGGCGGCCGCGGGAATTCGATTCGACAAGGCGTTGGTGCTGCCCACAAAGGC
CAGTTCGATATCGCGCTCGCGGGTTTGCAACTGCAACAGGCTCTGGTCATAGCCTTTGGGCACGAAC
ACCGCATCAAAGCCTTCTTCGCGCAGGCGTTCGCTGACCATGAAGCCCGAACAGATCACCCGCGCCC
AGGGCAGCTTGCGATAGTGCGCGCTGAACTTGCCGGTGTACTTGCAGGGGATGTAGTTCTGGTAGGC
ATCGTGTTCAAGGATGACCAGATTGGGAATCGTGCGGATGAACCCGACCTGACGGACTTCCTGCTTG
AAGCGCAGAAAAAACACGATCCGGTCATAACGCTCGACATCCACTTCACGGCGGAAATAGCCACGCA
AGGTTGCGCTGCTCATCGGAGCTCCAGCCCAACCGCACCTCGCACTCACAATACGCGGCGATGCCCT
TCATAAAGACGGTCGAGAATGGCCCGCTGCTCTTTCTTGGACCCCAGAAGTANAAACCTTTTCATGG
GGTNTTCCCTTGCCAGTTACCTGCGCCCCTGCCTGAAATCACGATATT
>Seq2
TGCTCCCGGCCGCACTGGCGGCCGCGGGAATTCGATTCGACAAGGCGTTGGTGCTGCCCACAAAGGCCAGTTCGATATCGCGCTCGCGGGTTTGCAACTGCAACAGGCTCTGGTCATAGCCTTTGGGCACGAACACCGCATCAAAGCCTTCTTCGCGCAGGCGTTCGCTGACCATGAAGCCCGAACAGATCACCCGCGCCCAGGGCAGCTTGCGATAGTGCGCGCTGAACTTGCCGGTGTACTTGCAGGGGATGTAGTTCTGGTAGGCATCGTGTTCAAGGATGACCAGATTGGGAATCGTGCGGATGAACCCGACCTGACGGACTTCCTGCTTGAAGCGCAGAAAAAACACGATCCGGTCATAACGCTCGACATCCACTTCACGGCGGAAATAGCCACGCAAGGTTGCGCTGCTCATCGGAGCTCCAGCCCAACCGCACCTCGCACTCACAATACGCGGCGATGCCCTTCATAAAGACGGTCGAGAATGGCCCGCTGCTCTTTCTTGGACCCCAGAAGTANAAACCTTTTCATGGGGTNTTCCCTTGCCAGTTACCTGCGCCCCTGCCTGAAATCACGATATT
>Seq3
TGCTCCCGGCCGCACTGGCGGCCGCGGGAATTCGATTCGACAAGGCGTTGGTGCTGCCCACAAAGGCCAGTTCGATATCGCGCTCGCGGGTTTGCAACTGCAACAGGCTCTGGTCATAGCCTTTGGGCACGAACACCGCATCAAAGCCTTCTTCGCGCAGGCGTTCGCTGACCATGAAGCCCGAACAGATCACCCGCGCCCAGGGCAGCTTGCGATAGTGCGCGCTGAACTTGCCGGTGTACTTGCAGGGGATGTAGTTCTGGTAGGCATCGTGTTCAAGGATGACCAGATTGGGAATCGTGCGGATGAACCCGACCTGACGGACTTCCTGCTTGAAGCGCAGAAAAAACACGATCCGGTCATAACGCTCGACATCCACTTCACGGCGGAAATAGCCACGCAAGGTTGCGCTGCTCATCGGAGCTCCAGCCCAACCGCACCTCGCACTCACAATACGCGGCGATGCCCTTCATAAAGACGGTCGAGAATGGCCCGCTGCTCTTTCTTGGACCCCAGAAGTANAAACCTTTTCATGGGGTNTTCCCTTGCCAGTTACCTGCGCCCCTGCCTGAAATCACGATATT

out_2

Seq1 Seq2 Seq3
>Seq4
TGCTCCCGGCCGCACTGGCGGCCGCGGGAATTCGATTCGACAAGGCGTTGGTGCTGCCCACAAAGGCCAGTTCGATATCGCGCTCGCGGGTTTGCAACTGCAACAGGCTCTGGTCATAGCCTTTGGGCACGAACACCGCATCAAAGCCTTCTTCGCGCAGGCGTTCGCTGACCATGAAGCCCGAACAGATCACCCGCGCCCAGGGCAGCTTGCGATAGTGCGCGCTGAACTTGCCGGTGTACTTGCAGGGGATGTAGTTCTGGTAGGCATCGTGTTCAAGGATGACCAGATTGGGAATCGTGCGGATGAACCCGACCTGACGGACTTCCTGCTTGAAGCGCAGAAAAAACACGATCCGGTCATAACGCTCGACATCCACTTCACGGCGGAAATAGCCACGCAAGGTTGCGCTGCTCATCGGAGCTCCAGCCCAACCGCACCTCGCACTCACAATACGCGGCGATGCCCTTCATAAAGACGGTCGAGAATGGCCCGCTGCTCTTTCTTGGACCCCAGAAGTANAAACCTTTTCATGGGGTNTTCCCTTGCCAGTTACCTGCGCCCCTGCCTGAAATCACGATATT
>Seq5
TGCTCCCGGCCGCACTGGCGGCCGCGGGAATTCGATTCGACAAGGCGTTGGTGCTGCCCACAAAGGCCAGTTCGATATCGCGCTCGCGGGTTTGCAACTGCAACAGGCTCTGGTCATAGCCTTTGGGCACGAACACCGCATCAAAGCCTTCTTCGCGCAGGCGTTCGCTGACCATGAAGCCCGAACAGATCACCCGCGCCCAGGGCAGCTTGCGATAGTGCGCGCTGAACTTGCCGGTGTACTTGCAGGGGATGTAGTTCTGGTAGGCATCGTGTTCAAGGATGACCAGATTGGGAATCGTGCGGATGAACCCGACCTGACGGACTTCCTGCTTGAAGCGCAGAAAAAACACGATCCGGTCATAACGCTCGACATCCACTTCACGGCGGAAATAGCCACGCAAGGTTGCGCTGCTCATCGGAGCTCCAGCCCAACCGCACCTCGCACTCACAATACGCGGCGATGCCCTTCATAAAGACGGTCGAGAATGGCCCGCTGCTCTTTCTTGGACCCCAGAAGTANAAACCTTTTCATGGGGTNTTCCCTTGCCAGTTACCTGCGCCCCTGCCTGAAATCACGATAT

out_3

Seq1 Seq2 Seq3
>Seq4
TGCTCCCGGCCGCACTGGCGGCCGCGGGAATTCGATTCGACAAGGCGTTGGTGCTGCCCACAAAGGCCAGTTCGATATCGCGCTCGCGGGTTTGCAACTGCAACAGGCTCTGGTCATAGCCTTTGGGCACGAACACCGCATCAAAGCCTTCTTCGCGCAGGCGTTCGCTGACCATGAAGCCCGAACAGATCACCCGCGCCCAGGGCAGCTTGCGATAGTGCGCGCTGAACTTGCCGGTGTACTTGCAGGGGATGTAGTTCTGGTAGGCATCGTGTTCAAGGATGACCAGATTGGGAATCGTGCGGATGAACCCGACCTGACGGACTTCCTGCTTGAAGCGCAGAAAAAACACGATCCGGTCATAACGCTCGACATCCACTTCACGGCGGAAATAGCCACGCAAGGTTGCGCTGCTCATCGGAGCTCCAGCCCAACCGCACCTCGCACTCACAATACGCGGCGATGCCCTTCATAAAGACGGTCGAGAATGGCCCGCTGCTCTTTCTTGGACCCCAGAAGTANAAACCTTTTCATGGGGTNTTCCCTTGCCAGTTACCTGCGCCCCTGCCTGAAATCACGATATT
>Seq5
TGCTCCCGGCCGCACTGGCGGCCGCGGGAATTCGATTCGACAAGGCGTTGGTGCTGCCCACAAAGGCCAGTTCGATATCGCGCTCGCGGGTTTGCAACTGCAACAGGCTCTGGTCATAGCCTTTGGGCACGAACACCGCATCAAAGCCTTCTTCGCGCAGGCGTTCGCTGACCATGAAGCCCGAACAGATCACCCGCGCCCAGGGCAGCTTGCGATAGTGCGCGCTGAACTTGCCGGTGTACTTGCAGGGGATGTAGTTCTGGTAGGCATCGTGTTCAAGGATGACCAGATTGGGAATCGTGCGGATGAACCCGACCTGACGGACTTCCTGCTTGAAGCGCAGAAAAAACACGATCCGGTCATAACGCTCGACATCCACTTCACGGCGGAAATAGCCACGCAAGGTTGCGCTGCTCATCGGAGCTCCAGCCCAACCGCACCTCGCACTCACAATACGCGGCGATGCCCTTCATAAAGACGGTCGAGAATGGCCCGCTGCTCTTTCTTGGACCCCAGAAGTANAAACCTTTTCATGGGGTNTTCCCTTGCCAGTTACCTGCGCCCCTGCCTGAAATCACGATAT

Advantages


  • will give only sequence for redundant FASTA headers
  • Case sensitive
  • Work with both multi line FASTA and single line FASTA
  • No comments:

    Post a Comment

    Have Problem ?? Drop a comments here!