perl中非常庞大的assosiative数组

发布时间：2020-12-15 21:16:05 所属栏目：大数据来源：网络整理

导读：我需要将两个文件合并到一个新文件中. 两者有超过300百万个以管道分隔的记录,第一列为主键.行未排序.第二个文件可能有第一个文件没有的记录. 示例文件1： 1001234|X15X1211,J,S,12,15,100.05 示例文件2： 1231112|AJ32,18,JP 1001234|AJ15,16,PP 输出： 1001

我需要将两个文件合并到一个新文件中.

两者有超过300百万个以管道分隔的记录,第一列为主键.行未排序.第二个文件可能有第一个文件没有的记录.

示例文件1：

1001234|X15X1211,J,S,12,15,100.05

示例文件2：

1231112|AJ32,18,JP     
1001234|AJ15,16,PP

输出：

1001234,X15X1211,100.05,AJ15,PP

我正在使用以下代码：

tie %hash_REP,'Tie::File::AsHash','rep.in',split => '|'
my $counter=0;
while (($key,$val) = each %hash_REP) {
    if($counter==0) {
        print strftime "%a %b %e %H:%M:%S %Y",localtime;
    }
}

准备关联数组需要将近1个小时.
这真的很好还是真的很糟糕？
有没有更快的方法来处理关联数组中的这种大小的记录？
任何脚本语言的任何建议都会有所帮助.

谢谢,
尼丁T.

我也尝试过以下程序,walso花了1小时如下：

#!/usr/bin/perl
use POSIX qw(strftime);
my $now_string = strftime "%a %b %e %H:%M:%S %Y",localtime;
print $now_string . "n";

my %hash;
open FILE,"APP.in" or die $!;
while (my $line = <FILE>) {
     chomp($line);
      my($key,$val) = split /|/,$line;
      $hash{$key} = $val;
 }
 close FILE;

my $filename = 'report.txt';
open(my $fh,'>',$filename) or die "Could not open file '$filename' $!";
open FILE,"rep.in" or die $!;
while (my $line = <FILE>) {
      chomp($line);
  my @words = split /|/,$line;
  for (my $i=0; $i <= $#words; $i++) {
    if($i == 0)
    {
       next;
    }
    print $fh  $words[$i] . "|^"
  }
  print $fh  $hash{$words[0]} . "n";
 }
 close FILE;
 close $fh;
 print "donen";

my $now_string = strftime "%a %b %e %H:%M:%S %Y",localtime;
print $now_string . "n";

解决方法

由于一些原因,您的技术效率极低.

>搭售非常缓慢.
>你把一切都拉进记忆中.

第一个可以通过自己阅读和分裂来缓解,但后者总是会成为一个问题.经验法则是避免将大量数据存入内存.它会占用所有内存并可能导致它交换到磁盘并减慢waaaay,特别是如果你使用旋转磁盘.

相反,您可以使用各种“磁盘哈希”来使用GDBM_File或BerkleyDB等模块.

但是真的没有理由搞乱他们因为我们有SQLite而且它做的更快更好.

在SQLite中创建一个表.

create table imported (
    id integer,value text
);

使用sqlite shell的.import导入文件,使用.mode和.separator调整格式.

sqlite>     create table imported (
   ...>         id integer,...>         value text
   ...>     );
sqlite> .mode list
sqlite> .separator |
sqlite> .import test.data imported
sqlite> .mode column
sqlite> select * from imported;
12345       NITIN     
12346       NITINfoo  
2398        bar       
9823        baz

现在,您和其他任何必须使用数据的人都可以使用高效,灵活的SQL做任何您喜欢的事情.即使导入需要一段时间,你也可以去做其他事情.

（编辑：李大同）

【声明】本站内容均来自网络，其相关言论仅代表作者个人观点，不代表本站立场。若无意侵犯到您的权利，请及时与联系站长删除相关内容!