如何从Perl中的文本文件中提取/解析表格数据？

发布时间：2020-12-15 23:36:21 所属栏目：大数据来源：网络整理

导读：我正在寻找像 HTML::TableExtract这样的东西,不是用于HTML输入,而是用于包含用缩进和间距格式化的“表格”的纯文本输入. 数据可能如下所示： Here is some header text.Column One Column Two Column Threea ba b cSome more textAnother Table Another Colu

我正在寻找像 HTML::TableExtract这样的东西,不是用于HTML输入,而是用于包含用缩进和间距格式化的“表格”的纯文本输入.

数据可能如下所示：

Here is some header text.

Column One       Column Two      Column Three
a                                           b
a                    b                      c


Some more text

Another Table     Another Column
abdbdbdb          aaaa

解决方法

不知道任何打包的解决方案,但是假设你可以对文件进行两次传递,那么非常灵活的事情是相当简单的:(以下是部分Perlish伪代码示例)

>假设：数据可能包含空格,如果有空格则不引用ala CSV – 如果不是这种情况,只需使用Text :: CSV(_XS).
>假设：没有用于格式化的标签.
>逻辑将“列分隔符”定义为100％填充空格的任何连续垂直行集.
>如果偶然的每行都有一个空格,它是偏移M个字符的数据的一部分,逻辑会将偏移量M视为列分隔符,因为它不能更好地知道.它可以更好地了解的唯一方法是,如果您需要将列分隔至少为X空格,其中X> 1 – 请参阅第二个代码片段.

示例代码：

my $INFER_FROM_N_LINES = 10; # Infer columns from this # of lines
                             # 0 means from entire file
my $lines_scanned = 0;
my @non_spaces=[];
# First pass - find which character columns in the file have all spaces and which don't
my $fh = open(...) or die;
while (<$fh>) {
    last if $INFER_FROM_N_LINES && $lines_scanned++ == $INFER_FROM_N_LINES;
    chomp;
    my $line = $_;
    my @chars = split(//,$line); 
    for (my $i = 0; $i < @chars; $i++) { # Probably can be done prettier via map?
        $non_spaces[$i] = 1 if $chars[$i] ne " ";
    }
}
close $fh or die;

# Find columns,defined as consecutive "non-spaces" slices.
my @starts,@ends; # Index at which columns start and end
my $state = " "; # Not inside a column
for (my $i = 0; $i < @non_spaces; $i++) {
    next if $state eq " " && !$non_spaces[$i];
    next if $state eq "c" && $non_spaces[$i];
    if ($state eq " ") { # && $non_spaces[$i] of course => start column
        $state = "c";
        push @starts,$i;
    } else { # meaning $state eq "c" && !$non_spaces[$i] => end column
        $state = " ";
        push @ends,$i-1;
    }
}
if ($state eq "c") { # Last char is NOT a space - produce the last column end
    push @ends,$#non_spaces;
}

# Now split lines
my $fh = open(...) or die;
my @rows = ();
while (<$fh>) {
    my @columns = ();
    push @rows,@columns;
    chomp;
    my $line = $_;
    for (my $col_num = 0; $col_num < @starts; $col_num++) {
        $columns[$col_num] = substr($_,$starts[$col_num],$ends[$col_num]-$starts[$col_num]+1);
    }
}
close $fh or die;

现在,如果您要求列分隔至少为X> 1的X空格,那么它也是可行的但是列位置的解析器需要更复杂一些：

# Find columns,defined as consecutive "non-spaces" slices separated by at least 3 spaces.
my $min_col_separator_is_X_spaces = 3;
my @starts,@ends; # Index at which columns start and end
my $state = "S"; # inside a separator
NEXT_CHAR: for (my $i = 0; $i < @non_spaces; $i++) {
    if ($state eq "S") { # done with last column,inside a separator
        if ($non_spaces[$i]) { # start a new column
            $state = "c";
            push @starts,$i;
        }
        next;
    }
    if ($state eq "c") { # Processing a column
        if (!$non_spaces[$i]) { # First space after non-space
                                # Could be beginning of separator? check next X chars!
            for (my $j = $i+1; $j < @non_spaces
                            || $j < $i+$min_col_separator_is_X_spaces; $j++) {
                 if ($non_spaces[$j]) {
                     $i = $j++; # No need to re-scan again
                     next NEXT_CHAR; # OUTER loop
                 }
                 # If we reach here,next X chars are spaces! Column ended!
                 push @ends,$i-1;
                 $state = "S";
                 $i = $i + $min_col_separator_is_X_spaces;
            }
         }
        next;
    }
}

（编辑：李大同）

【声明】本站内容均来自网络，其相关言论仅代表作者个人观点，不代表本站立场。若无意侵犯到您的权利，请及时与联系站长删除相关内容!