Perl使用主键逐行合并2个csv文件

发布时间：2020-12-15 22:05:11 所属栏目：大数据来源：网络整理

导读：编辑：添加解决方案. 嗨,我目前有一些工作虽然代码很慢. 它使用主键逐行合并2个CSV文件. 例如,如果文件1具有以下行： "one,two,four,42" 和文件2有这一行; "one,three,42" 其中0索引$position = 4主键= 42; 然后是sub：merge_file($file1,$file2,$outputfile

编辑：添加解决方案.

嗨,我目前有一些工作虽然代码很慢.

它使用主键逐行合并2个CSV文件.
例如,如果文件1具有以下行：

"one,two,four,42"

和文件2有这一行;

"one,three,42"

其中0索引$position = 4主键= 42;

然后是sub：merge_file($file1,$file2,$outputfile,$position);

将输出一行文件：

"one,42";

每个主键在每个文件中都是唯一的,一个键可能存在于一个文件中但不存在于另一个文件中(反之亦然)

每个文件大约有100万行.

通过第一个文件中的每一行,我使用哈希来存储主键,并将行号存储为值.行号对应于存储第一个文件中每一行的数组[行号].

然后我遍历第二个文件中的每一行,并检查主键是否在哈希中,如果是,则从file1array获取行,然后将我需要的列从第一个数组添加到第二个数组,并且然后结束.到最后.然后删除哈希值,然后在最后,将整个事件转储到文件中. (我正在使用SSD,所以我想最小化文件写入.)

最好用代码解释：

sub merge_file2{
 my ($file1,$out,$position) = ($_[0],$_[1],$_[2],$_[3]);
 print "merging: n$file1 and n$file2,to: n$outn";
 my $OUTSTRING = undef;

 my %line_for;
 my @file1array;
 open FILE1,"<$file1";
 print "$file1 openedn";
 while (<FILE1>){
      chomp;
      $line_for{read_csv_string($_,$position)}=$.; #reads csv line at current position (of key)
      $file1array[$.] = $_; #store line in file1array.
 }
 close FILE1;
 print "$file2 opened - merging..n";
 open FILE2,"<",$file2;
 my @from1to2 = qw( 2 4 8 17 18 19); #which columns from file 1 to be added into cols. of file 2.
 while (<FILE2>){
      print "$.n" if ($.%1000) == 0;
      chomp;
      my @array1 = ();
      my @array2 = ();
      my @array2 = split /,/,$_; #split 2nd csv line by commas

      my @array1 = split /,$file1array[$line_for{$array2[$position]}];
      #                            ^         ^                  ^
      # prev line  lookup line in 1st file,lookup hash,pos of key
      #my @output = &merge_string(@array1,@array2); #merge 2 csv strings (old fn.)

      foreach(@from1to2){
           $array2[$_] = $array1[$_];
      }
      my $outstring = join ",",@array2;
      $OUTSTRING.=$outstring."n";
      delete $line_for{$array2[$position]};
 }
 close FILE2;
 print "adding rest of linesn";
 foreach my $key (sort { $a <=> $b } keys %line_for){
      $OUTSTRING.= $file1array[$line_for{$key}]."n";
 }

 print "writing file $outnnn";
 write_line($out,$OUTSTRING);
}

第一次很好,不到1分钟,但第二次循环需要大约1小时才能运行,我想知道我是否采取了正确的方法.我认为有可能加速很多？：) 提前致谢.

解：

sub merge_file3{
my ($file1,$position,$hsize) = ($_[0],$_[3],$_[4]);
print "merging: n$file1 and n$file2,to: n$outn";
my $OUTSTRING = undef;
my $header;

my (@file1,@file2);
open FILE1,"<$file1" or die;
while (<FILE1>){
    if ($.==1){
        $header = $_;
        next;
    }
    print "$.n" if ($.%100000) == 0;
    chomp;
    push @file1,[split ',',$_];
}
close FILE1;

open FILE2,"<$file2" or die;
while (<FILE2>){
    next if $.==1;
    print "$.n" if ($.%100000) == 0;
    chomp;
    push @file2,$_];
}
close FILE2;

print "sorting filesn";
my @sortedf1 = sort {$a->[$position] <=> $b->[$position]} @file1;
my @sortedf2 = sort {$a->[$position] <=> $b->[$position]} @file2;   
print "sortedn";
@file1 = undef;
@file2 = undef;
#foreach my $line (@file1){print "t [ @$line ],n";    }

my ($i,$j) = (0,0);
while ($i < $#sortedf1 and $j < $#sortedf2){
    my $key1 = $sortedf1[$i][$position];
    my $key2 = $sortedf2[$j][$position];
    if ($key1 eq $key2){
        foreach(0..$hsize){ #header size.
            $sortedf2[$j][$_] = $sortedf1[$i][$_] if $sortedf1[$i][$_] ne undef;
        }
        $i++;
        $j++;
    }
    elsif ( $key1 < $key2){
        push(@sortedf2,[@{$sortedf1[$i]}]);
        $i++;
    }
    elsif ( $key1 > $key2){ 
        $j++;
    }
}

#foreach my $line (@sortedf2){print "t [ @$line ],n"; }

print "outputting to filen";
open OUT,">$out";
print OUT $header;
foreach(@sortedf2){
    print OUT (join ",@{$_})."n";
}
close OUT;

}

谢谢大家,解决方案在上面发布.现在合并整个事情需要大约1分钟！

（编辑：李大同）

【声明】本站内容均来自网络，其相关言论仅代表作者个人观点，不代表本站立场。若无意侵犯到您的权利，请及时与联系站长删除相关内容!