Perl：从大量数据中删除重复项

发布时间：2020-12-15 22:06:56 所属栏目：大数据来源：网络整理

导读：我正在使用Perl生成一个独特的外显子列表(它是基因的单位). 我已经生成了这种格式的文件(包含数十万行)： chr1 1000 2000 gene1 chr1 3000 4000 gene2 chr1 5000 6000 gene3 chr1 1000 2000 gene4 位置1是染色体,位置2是外显子的起始坐标,位置3是外显子的结

我正在使用Perl生成一个独特的外显子列表(它是基因的单位).

我已经生成了这种格式的文件(包含数十万行)：

chr1 1000 2000 gene1

chr1 3000 4000 gene2

chr1 5000 6000 gene3

chr1 1000 2000 gene4

位置1是染色体,位置2是外显子的起始坐标,位置3是外显子的结束坐标,位置4是基因名称.

因为基因通常由外显子的不同排列构成,所以在多个基因中具有相同的外显子(参见第一组和第四组).我想删除这些“重复” – 即删除gene1或gene4(不重要的是哪一个被删除).

我把头撞在墙上好几个小时试图做(我认为)这是一项简单的任务.有人能指出我正确的方向吗？我知道人们经常使用哈希来删除重复的元素,但这些并不完全重复(因为基因名称不同).重要的是我也不要丢失基因名称.否则这会更简单.

这是我尝试过的完全无功能的循环. “外显子”数组将每一行存储为标量,因此子程序.不要笑.我知道它不起作用,但至少你可以看到(我希望)我正在尝试做的事情：

for (my $i = 0; $i < scalar @exons; $i++) {
my @temp_line = line_splitter($exons[$i]);                      # runs subroutine turning scalar into array
for (my $j = 0; $j < scalar @exons_dup; $j++) {
    my @inner_temp_line = line_splitter($exons_dup[$j]);        # runs subroutine turning scalar into array
    unless (($temp_line[1] == $inner_temp_line[1]) &&           # this loop ensures that the the loop
            ($temp_line[3] eq $inner_temp_line[3])) {           # below skips the identical lines
                if (($temp_line[1] == $inner_temp_line[1]) &&   # if the coordinates are the same
                    ($temp_line[2] == $inner_temp_line[2])) {   # between the comparisons
                        splice(@exons,$i,1);                  # delete the first one
                    }
            }
}

}

解决方法

my @exons = (
    'chr1 1000 2000 gene1','chr1 3000 4000 gene2','chr1 5000 6000 gene3','chr1 1000 2000 gene4'
);

my %unique_exons = map { 
    my ($chro,$scoor,$ecoor,$gene) = (split(/s+/,$_));
    "$chro $scoor $ecoor" => $gene
} @exons;

print "$_ $unique_exons{$_} n" for keys %unique_exons;

这将为您提供独特性,并将包含最后一个基因名称.这导致：

chr1 1000 2000 gene4 
chr1 5000 6000 gene3 
chr1 3000 4000 gene2

（编辑：李大同）

【声明】本站内容均来自网络，其相关言论仅代表作者个人观点，不代表本站立场。若无意侵犯到您的权利，请及时与联系站长删除相关内容!