perl – 如何在蛋白质序列(string)中找到多个motifs(substring)

发布时间：2020-12-15 23:36:16 所属栏目：大数据来源：网络整理

导读：以下脚本用于在蛋白质序列中找到一个基序. use strict;use warnings;my @file_data=();my $protein_seq='';my $h= '[VLIM]'; my $s= '[AG]';my $x= '[ARNDCEQGHILKMFPSTWYV]';my $regexp = "($h){4}D($x){4}D"; #motif to be searched is hhhhDxxxxDmy @loca

以下脚本用于在蛋白质序列中找到一个基序.

use strict;
use warnings;

my @file_data=();
my $protein_seq='';
my $h= '[VLIM]';   
my $s= '[AG]';
my $x= '[ARNDCEQGHILKMFPSTWYV]';
my $regexp = "($h){4}D($x){4}D"; #motif to be searched is hhhhDxxxxD
my @locations=();

@file_data= get_file_data("seq.txt");

$protein_seq= extract_sequence(@file_data); 

#searching for a motif hhhhDxxxxD in each protein sequence in the give file

foreach my $line(@file_data){
    if ($motif=~ /$regexp/){
        print "found motif nn";
      } else {
        print "not found nn";
    }
}
#recording the location/position of motif to be outputed

@locations= match_position($regexp,$seq);
if (@locations){ 
    print "Searching for motifs $regexp n";
    print "Catalytic site is at location:n";
  } else {
    print "motif not found nn";
}
exit;

sub get_file_data{
    my ($filename)=@_;
    use strict;
    use warnings;
    my $sequence='';

    foreach my $line(@fasta_file_data){
        if ($line=~ /^s*(#.*)?|^>/{
            next;
          } 
        else {
            $sequence.=$line;
        }
    }
    $sequence=~ s/s//g;
    return $sequence;
}

sub(match_positions) {
    my ($regexp,$sequence)=@_;
    use strict;
    my @position=();
    while ($sequence=~ /$regexp/ig){
        push (@position,$-[0]);
    }
    return @position;
}

我不知道如何扩展这个以在含有蛋白质序列的给定文件中找到多个基序(以固定顺序,即motif1,motif2,motif3).

解决方法

您可以简单地使用序列的替换(由|分隔).这样,正则表达式引擎的每个序列都可以匹配它.

/($h{4}D$x{4}D|$x{1,4}A{1,2}$s{2})/

然后你可以通过查看1美元来测试这场比赛.

（编辑：李大同）

【声明】本站内容均来自网络，其相关言论仅代表作者个人观点，不代表本站立场。若无意侵犯到您的权利，请及时与联系站长删除相关内容!