加入收藏 | 设为首页 | 会员中心 | 我要投稿 李大同 (https://www.lidatong.com.cn/)- 科技、建站、经验、云计算、5G、大数据,站长网!
当前位置: 首页 > 大数据 > 正文

使用perl解析大型(100 Mb)XML文件时“内存不足”

发布时间:2020-12-15 21:39:33 所属栏目:大数据 来源:网络整理
导读:解析大(100 Mb) XML文件时出现“Out of memory”错误 use strict;use warnings;use XML::Twig;my $twig=XML::Twig-new();my $data = XML::Twig-new -parsefile("divisionhouserooms-v3.xml") -simplify( keyattr = []);my @good_division_numbers = qw( 30 3
解析大(100 Mb) XML文件时出现“Out of memory”错误
use strict;
use warnings;
use XML::Twig;

my $twig=XML::Twig->new();
my $data = XML::Twig->new
             ->parsefile("divisionhouserooms-v3.xml")
               ->simplify( keyattr => []);

my @good_division_numbers = qw( 30 31 32 35 38 );

foreach my $property ( @{ $data->{DivisionHouseRoom}}) {

    my $house_code = $property->{HouseCode};
    print $house_code,"n";

    my $amount_of_bedrooms = 0;

    foreach my $division ( @{ $property->{Divisions}->{Division} } ) {

        next unless grep { $_ eq $division->{DivisionNumber} } @good_division_numbers;
        $amount_of_bedrooms += $division->{DivisionQuantity};
    }

    open my $fh,">>","Result.csv" or die $!;
    print $fh join("t",$house_code,$amount_of_bedrooms),"n";
    close $fh;
}

我能做些什么来解决这个错误问题?

解决方法

处理不适合内存的大型XML文件是 XML::Twig advertises:

One of the strengths of XML::Twig is that it let you work with files
that do not fit in memory (BTW storing an XML document in memory as a
tree is quite memory-expensive,the expansion factor being often
around 10).

To do this you can define handlers,that will be called once a
specific element has been completely parsed. In these handlers you can
access the element and process it as you see fit (…)

问题中发布的代码根本没有充分利用XML :: Twig的优势(使用简化方法并没有比XML::Simple更好).

代码中缺少的是’twig_handlers’或’twig_roots’,这实际上导致解析器有效地关注XML文档内存的相关部分.

很难说没有看到XML是processing the document chunk-by-chunk还是just selected parts,但任何一个都应该解决这个问题.

所以代码应该类似于以下内容(chunk-by-chunk演示):

use strict;
use warnings;
use XML::Twig;
use List::Util 'sum';   # To make life easier
use Data::Dump 'dump';  # To see what's going on

my %bedrooms;           # Data structure to store the wanted info

my $xml = XML::Twig->new (
                          twig_roots => {
                                          DivisionHouseRoom => &;count_bedrooms,}
                         );

$xml->parsefile( 'divisionhouserooms-v3.xml');

sub count_bedrooms {

    my ( $twig,$element ) = @_;

    my @divParents = $element->children( 'Divisions' );
    my $id = $element->first_child_text( 'HouseCode' );

    for my $divParent ( @divParents ) {
        my @divisions = $divParent->children( 'Division' );
        my $total = sum map { $_->text } @divisions;
        $bedrooms{$id} = $total;
    }

    $element->purge;   # Free up memory
}

dump %bedrooms;

(编辑:李大同)

【声明】本站内容均来自网络,其相关言论仅代表作者个人观点,不代表本站立场。若无意侵犯到您的权利,请及时与联系站长删除相关内容!

    推荐文章
      热点阅读