如何从Perl中的文本文件中提取/解析表格数据?
发布时间:2020-12-15 23:36:21 所属栏目:大数据 来源:网络整理
导读:我正在寻找像 HTML::TableExtract这样的东西,不是用于HTML输入,而是用于包含用缩进和间距格式化的“表格”的纯文本输入. 数据可能如下所示: Here is some header text.Column One Column Two Column Threea ba b cSome more textAnother Table Another Colu
我正在寻找像
HTML::TableExtract这样的东西,不是用于HTML输入,而是用于包含用缩进和间距格式化的“表格”的纯文本输入.
数据可能如下所示: Here is some header text. Column One Column Two Column Three a b a b c Some more text Another Table Another Column abdbdbdb aaaa 解决方法
不知道任何打包的解决方案,但是假设你可以对文件进行两次传递,那么非常灵活的事情是相当简单的:(以下是部分Perlish伪代码示例)
>假设:数据可能包含空格,如果有空格则不引用ala CSV – 如果不是这种情况,只需使用Text :: CSV(_XS). 示例代码: my $INFER_FROM_N_LINES = 10; # Infer columns from this # of lines # 0 means from entire file my $lines_scanned = 0; my @non_spaces=[]; # First pass - find which character columns in the file have all spaces and which don't my $fh = open(...) or die; while (<$fh>) { last if $INFER_FROM_N_LINES && $lines_scanned++ == $INFER_FROM_N_LINES; chomp; my $line = $_; my @chars = split(//,$line); for (my $i = 0; $i < @chars; $i++) { # Probably can be done prettier via map? $non_spaces[$i] = 1 if $chars[$i] ne " "; } } close $fh or die; # Find columns,defined as consecutive "non-spaces" slices. my @starts,@ends; # Index at which columns start and end my $state = " "; # Not inside a column for (my $i = 0; $i < @non_spaces; $i++) { next if $state eq " " && !$non_spaces[$i]; next if $state eq "c" && $non_spaces[$i]; if ($state eq " ") { # && $non_spaces[$i] of course => start column $state = "c"; push @starts,$i; } else { # meaning $state eq "c" && !$non_spaces[$i] => end column $state = " "; push @ends,$i-1; } } if ($state eq "c") { # Last char is NOT a space - produce the last column end push @ends,$#non_spaces; } # Now split lines my $fh = open(...) or die; my @rows = (); while (<$fh>) { my @columns = (); push @rows,@columns; chomp; my $line = $_; for (my $col_num = 0; $col_num < @starts; $col_num++) { $columns[$col_num] = substr($_,$starts[$col_num],$ends[$col_num]-$starts[$col_num]+1); } } close $fh or die; 现在,如果您要求列分隔至少为X> 1的X空格,那么它也是可行的但是列位置的解析器需要更复杂一些: # Find columns,defined as consecutive "non-spaces" slices separated by at least 3 spaces. my $min_col_separator_is_X_spaces = 3; my @starts,@ends; # Index at which columns start and end my $state = "S"; # inside a separator NEXT_CHAR: for (my $i = 0; $i < @non_spaces; $i++) { if ($state eq "S") { # done with last column,inside a separator if ($non_spaces[$i]) { # start a new column $state = "c"; push @starts,$i; } next; } if ($state eq "c") { # Processing a column if (!$non_spaces[$i]) { # First space after non-space # Could be beginning of separator? check next X chars! for (my $j = $i+1; $j < @non_spaces || $j < $i+$min_col_separator_is_X_spaces; $j++) { if ($non_spaces[$j]) { $i = $j++; # No need to re-scan again next NEXT_CHAR; # OUTER loop } # If we reach here,next X chars are spaces! Column ended! push @ends,$i-1; $state = "S"; $i = $i + $min_col_separator_is_X_spaces; } } next; } } (编辑:李大同) 【声明】本站内容均来自网络,其相关言论仅代表作者个人观点,不代表本站立场。若无意侵犯到您的权利,请及时与联系站长删除相关内容! |