尝试使用Perl中的splice删除特定列
我是一个全新的Perl新手,寻找我的第一个Perl脚本的帮助
我有一些30-50GB文件的巨大文件,它们构造如下 – 数百万列和数千行: A B C D E 1 2 3 4 5 6 7 8 9 10 A B C D E 1 2 3 4 5 6 7 8 9 10 A B C D E 1 2 3 4 5 6 7 8 9 10 A B C D E 1 2 3 4 5 6 7 8 9 10 A B C D E 1 2 3 4 5 6 7 8 9 10 A B C D E 1 2 3 4 5 6 7 8 9 10 A B C D E 1 2 3 4 5 6 7 8 9 10 我想删除列“A”,列“C”,然后是数字列的第三个,所以“3”列和“6”列,然后是“9”列,直到文件结束.空格分隔. 我的尝试是这样的: #!/usr/local/bin/perl use strict; use warnings; my @dataColumns; my $dataColumnCount; if(scalar(@ARGV) != 2){ print "nNo files supplied,please supply file namen"; exit; } my $Infile = $ARGV[0]; my $Outfile = $ARGV[1]; open(INFO,$Infile) || die "Could not open $Infile for reading"; open(OUT,">$Outfile") || die "Could not open $Outfile for writing"; while (<INFO>) { chop; @dataColumns = split(" "); $dataColumnCount = @dataColumns + 1; #Now remove the first element of the list shift(@dataColumns); #Now remove the third element (Note that it is now the second - after removal of the first) splice(@dataColumns,1,1); # remove the third element (now the second) #Now remove the 6th (originally the 8th) and every third one thereafter #NB There are now $dataColumnCount-1 columns for (my $i = 5; $i < $dataColumnCount-1; $i = $i + 3 ) { splice($dataColumns; $i; 1); } #Now join the remaining elements of the list back into a single string my $AmendedLine = join(" ",@dataColumns); #Finally print out the line into your new file print OUT "$AmendedLine/n"; } 但我得到一些奇怪的错误: >它说它不喜欢我的$1 for for循环,我添加了一个’my’,它似乎让错误消失但是其他人的代码似乎包含了’my’,所以我不确定是什么正在进行. 全局符号“$i”需要在Convertversion2.pl第36行显式包名. >另一个错误是这样的: 我不知道如何纠正这个错误,我想我几乎就在那里,但不确定究竟是什么语法错误,我不确定如何解决它. 先感谢您. 解决方法
在我发表关于这个问题的博客之后,一位评论者指出,对于我的测试用例,可以将运行时间减少45%.我稍微解释了他的代码:
my @keep; while (<>) { my @data = split; unless (@keep) { @keep = (0,1); for (my $i = 5; $i < @data; $i += 3) { push @keep,0; } } my $i = 0; print join(' ',grep $keep[$i++],@data),"n"; } 这几乎是我原始解决方案花费的一半时间: $time ./zz.pl input.data > /dev/null real 0m21.861s user 0m21.310s sys 0m0.280s 现在,it is possible to gain another 45% performance以相当脏的方式使用Inline::C: #!/usr/bin/env perl use strict; use warnings; use Inline C => <<'END_C' /* This code 'works' only in a limited set of circumstances! Don't expect anything good if you feed it anything other than plain ASCII */ #include <ctype.h> SV * extract_fields(char *line,AV *wanted_fields) { int ch; IV current_field = 0; IV wanted_field = -1; unsigned char *cursor = line; unsigned char *field_begin = line; unsigned char *save_field_begin; STRLEN field_len = 0; IV i_wanted = 0; IV n_wanted = av_len(wanted_fields); AV *ret = newAV(); while (i_wanted <= n_wanted) { SV **p_wanted = av_fetch(wanted_fields,i_wanted,0); if (!(*p_wanted)) { croak("av_fetch returned NULL pointer"); } wanted_field = SvIV(*p_wanted); while ((ch = *(cursor++))) { if (!isspace(ch)) { continue; } field_len = cursor - field_begin - 1; save_field_begin = field_begin; field_begin = cursor; current_field += 1; if (current_field != wanted_field) { continue; } av_push(ret,newSVpvn(save_field_begin,field_len)); break; } i_wanted += 1; } return newRV_noinc((SV *) ret); } END_C ; 而且,这是Perl部分.请注意,我们只拆分一次以确定要保留的字段索引.一旦我们知道了这些,我们将行和基于(1的)索引传递给C例程以切片和切块. my @keep; while (my $line = <>) { unless (@keep) { @keep = (2,4,5); my @data = split ' ',$line; push @keep,grep +(($_ - 5) % 3),6 .. scalar(@data); } my $fields = extract_fields($line,@keep); print join(' ',@$fields),"n"; } $time ./ww.pl input.data > /dev/null real 0m11.539s user 0m11.083s sys 0m0.300s input.data是使用以下方法生成的: $perl -E 'say join(" ","A" .. "ZZZZ") for 1 .. 100' > input.data 它的大小约为225MB. (编辑:李大同) 【声明】本站内容均来自网络,其相关言论仅代表作者个人观点,不代表本站立场。若无意侵犯到您的权利,请及时与联系站长删除相关内容! |