使用处理函数在什么时候提高了HTML解析效率？

发布时间：2020-12-14 23:26:09 所属栏目：资源来源：网络整理

导读：这个问题使用R语言.它还标记为[xml]和[html],以防这些用户可能对该问题有任何输入. 使用包XML,我一直认为使用处理函数来解析HTML文档,因为它是在C级创建的,这将提高整体效率.但是,我现在已经工作了一段时间才找到这种想法实际上成真的情况. 我想也许我不是在

这个问题使用R语言.它还标记为[xml]和[html],以防这些用户可能对该问题有任何输入.

使用包XML,我一直认为使用处理函数来解析HTML文档,因为它是在C级创建的,这将提高整体效率.但是,我现在已经工作了一段时间才找到这种想法实际上成真的情况.

我想也许我不是在考虑正确的上下文中的情况(也许一个处理程序在更大的递归文档上会更有用？).无论如何,这是我的理由.

请看以下两个例子.

library(XML)
library(microbenchmark)
u <- "http://www.baseball-reference.com"

示例1：获取名为“input”的所有节点的属性(搜索表单名称)

withHandler1 <- function() {
    h <- function() {
        input <- character()
        list(input = function(node,...) {
            input <<- c(input,list(xmlAttrs(node,...)))
            node
        },value = function() input)
    }
    h1 <- h()
    htmlParse(u,handler = h1)
    h1$value()
}

withoutHandler1 <- function() {
    xmlApply(htmlParse(u)["//input"],xmlAttrs)
}

identical(withHandler1(),withoutHandler1())
# [1] TRUE

microbenchmark(withHandler1(),withoutHandler1(),times = 25L)
# Unit: milliseconds
#              expr      min       lq     mean   median       uq     max neval cld
#    withHandler1() 944.6507 1001.419 1051.602 1020.347 1097.073 1315.23    25   a
# withoutHandler1() 964.6079 1006.799 1040.905 1039.993 1069.029 1126.49    25   a

好吧,这是一个非常基本的例子,但时间几乎相同,我觉得好像我将它运行默认100次,他们可能会收敛.

示例2：获取名为“input”的所有节点的属性的子集

withHandler2  <- function() {    
    searchBoxHandler <- function(attr = character()) {
        input <- character()
        list(input = function(node,list(
                if(identical(attr,character())) xmlAttrs(node,...)
                else vapply(attr[attr %in% names(xmlAttrs(node))],xmlGetAttr,"",node = node)
            ))
            node
        },value = function() input)
    }
    h1 <- searchBoxHandler(attr = c("id","type"))
    htmlParse(u,handler = h1)
    h1$value()
}    

withoutHandler2 <- function() {
    xmlApply(htmlParse(u)["//input"],function(x) {
        ## Note: match() used only to return identical objects
        xmlAttrs(x)[na.omit(match(c("id","type"),names(xmlAttrs(x))))]
    })
}

identical(withHandler2(),withoutHandler2())
# [1] TRUE

microbenchmark(withHandler2(),withoutHandler2(),times = 25L)
# Unit: milliseconds
#              expr      min        lq     mean   median       uq      max neval cld
#    withHandler2() 966.0951 1010.3940 1129.360 1038.206 1119.642 2075.070    25   a
# withoutHandler2() 962.8655  999.4754 1166.231 1046.204 1118.661 2385.782    25   a

再次,非常基本.但也几乎一样.

所以我的问题是,为什么要使用处理函数呢？对于这些示例,编写处理程序是浪费精力.那么是否有特定的操作可能非常昂贵,在解析HTML时我会通过使用处理函数看到速度和效率的显着提高？

解决方法

参考维基百科上的 XML文章,编程接口部分：

>用于XML处理的现有API往往属于以下类别：
面向流的API可从编程语言访问,用于
例如SAX和StAX.
>可以从编程语言访问的树遍历API
示例DOM.
> XML数据绑定,它提供了一个自动翻译
XML文档和编程语言对象.
>声明性转换语言,如XSLT和
XQuery的.

Stream-oriented facilities require less memory and,for certain tasks
which are based on a linear traversal of an XML document,are faster
and simpler than other alternatives. Tree-traversal and data-binding
APIs typically require the use of much more memory,but are often
found more convenient for use by programmers; some include declarative
retrieval of document components via the use of XPath expressions.
XSLT is designed for declarative description of XML document
transformations,and has been widely implemented both in server-side
packages and Web browsers. XQuery overlaps XSLT in its functionality,
but is designed more for searching of large XML databases.

现在很清楚,性能不是唯一要考虑的因素,例如：

SAX is fast and efficient to implement,but difficult to use for
extracting information at random from the XML,since it tends to
burden the application author with keeping track of what part of the
document is being processed. It is better suited to situations in
which certain types of information are always handled the same way,no
matter where they occur in the document.

在另一方面：

The Document Object Model (DOM) is an interface-oriented application
programming interface that allows for navigation of the entire
document as if it were a tree of node objects representing the
document’s contents. A DOM document can be created by a parser,or can
be generated manually by users (with limitations). Data types in DOM
nodes are abstract; implementations provide their own programming
language-specific bindings. DOM implementations tend to be memory
intensive,as they generally require the entire document to be loaded
into memory and constructed as a tree of objects before access is
allowed.

把它们加起来：

您的示例不是数据可以更大的实例,只有这样情况才能决定使用最佳接口.

（编辑：李大同）

【声明】本站内容均来自网络，其相关言论仅代表作者个人观点，不代表本站立场。若无意侵犯到您的权利，请及时与联系站长删除相关内容!