ruby – 使用nokogiri在HTML标记之间提取文本

发布时间：2020-12-17 03:59:31 所属栏目：百科来源：网络整理

导读：我有这样的 HTML： h1 Header is here/h1 h2Header 2 is here/h2 p Extract me!/p p Extract me too!/p h2 Next Header 2/h2 pnot interested/p pnot interested/p h2Header 2 is here/h2 p Extract me!/p p Extract me too!/p 我有一个基本的Nokogiri CSS节

我有这样的 HTML：

<h1> Header is here</h1>
  <h2>Header 2 is here</h2>
     <p> Extract me!</p>
     <p> Extract me too!</p>
  <h2> Next Header 2</h2>
     <p>not interested</p>
     <p>not interested</p>
  <h2>Header 2 is here</h2>
     <p> Extract me!</p>
     <p> Extract me too!</p>

我有一个基本的Nokogiri CSS节点搜索返回< p>内容,但我找不到如何在第N个关闭H2和下一个打开H2之间定位所有文本的示例.我正在创建一个包含输出的CSV,所以我还想在文件列表中读取并将URL作为第一个结果.

解决方法

require 'rubygems'
require 'nokogiri'

h = '<h1> Header is here</h1>
  <h2>Header 2 is here</h2>
     <p> Extract me!</p>
     <p> Extract me too!</p>
  <h2> Next Header 2</h2>
     <p>not interested</p>
     <p>not interested</p>
  <h2>Header 2 is here</h2>
     <p> Extract me!</p>
     <p> Extract me too!</p>
'

doc = Nokogiri::HTML(h)

# Specify the range between delimiter tags that you want to extract
# triple dot is used to exclude the end point
# 1...2 means 1 and not 2
EXTRACT_RANGES = [
  2...3,4...5
]

# Tags which count as delimiters,not to be extracted
DELIMITER_TAGS = [
  "h1","h2"
]

extracted_text = []

i = 0
# Change /"html"/"body" to the correct path of the tag which contains this list
(doc/"html"/"body").children.each do |el|

  if (DELIMITER_TAGS.include? el.name)
    i += 1
  else
    extract = false
    EXTRACT_RANGES.each do |cur_range|
      if (cur_range.include? i)
        extract = true
        break
      end
    end

    if extract
      s = el.inner_text.strip
      unless s.empty?
        extracted_text << el.inner_text.strip
      end
    end
  end

end

# Print out extracted text (each element's inner text is separated by newlines)
puts extracted_text.join("n")

（编辑：李大同）

【声明】本站内容均来自网络，其相关言论仅代表作者个人观点，不代表本站立场。若无意侵犯到您的权利，请及时与联系站长删除相关内容!