解析 – 如何使用FParsec解析F#中的一个非常大的文件

发布时间：2020-12-14 04:44:04 所属栏目：大数据来源：网络整理

导读：我正在尝试使用FParsec解析一个非常大的文件.该文件的大小为61GB,太大而无法容纳在RAM中,所以我想生成一系列结果(即seq'Result),而不是列表,如果可能的话.这可以用FParsec完成吗？ (我已经提出了一个实际执行此操作的简易操作实现,但由于CharStream.Seek的O(

我正在尝试使用FParsec解析一个非常大的文件.该文件的大小为61GB,太大而无法容纳在RAM中,所以我想生成一系列结果(即seq<'Result>),而不是列表,如果可能的话.这可以用FParsec完成吗？ (我已经提出了一个实际执行此操作的简易操作实现,但由于CharStream.Seek的O(n)性能,它在实践中不能很好地工作.)

该文件是面向行的(每行一个记录),这应该使理论上可以一次分批解析1000个记录. FParsec“Tips and tricks”部分说：

If you’re dealing with large input files or very slow parsers,it
might also be worth trying to parse multiple sections within a single
file in parallel. For this to be efficient there must be a fast way to
find the start and end points of such sections. For example,if you
are parsing a large serialized data structure,the format might allow
you to easily skip over segments within the file,so that you can chop
up the input into multiple independent parts that can be parsed in
parallel. Another example could be a programming languages whose
grammar makes it easy to skip over a complete class or function
definition,e.g. by finding the closing brace or by interpreting the
indentation. In this case it might be worth not to parse the
definitions directly when they are encountered,but instead to skip
over them,push their text content into a queue and then to process
that queue in parallel.

这对我来说听起来很完美：我想将每批记录预解析成一个队列,然后稍后并行完成解析.但是,我不知道如何使用FParsec API完成此任务.如何在不耗尽所有RAM的情况下创建这样的队列？

FWIW,我试图解析的文件是here,如果有人想试试我的话.

（编辑：李大同）

【声明】本站内容均来自网络，其相关言论仅代表作者个人观点，不代表本站立场。若无意侵犯到您的权利，请及时与联系站长删除相关内容!