ruby – 使用nokogiri在HTML标记之间提取文本
发布时间:2020-12-17 03:59:31 所属栏目:百科 来源:网络整理
导读:我有这样的 HTML: h1 Header is here/h1 h2Header 2 is here/h2 p Extract me!/p p Extract me too!/p h2 Next Header 2/h2 pnot interested/p pnot interested/p h2Header 2 is here/h2 p Extract me!/p p Extract me too!/p 我有一个基本的Nokogiri CSS节
我有这样的
HTML:
<h1> Header is here</h1> <h2>Header 2 is here</h2> <p> Extract me!</p> <p> Extract me too!</p> <h2> Next Header 2</h2> <p>not interested</p> <p>not interested</p> <h2>Header 2 is here</h2> <p> Extract me!</p> <p> Extract me too!</p> 我有一个基本的Nokogiri CSS节点搜索返回< p>内容,但我找不到如何在第N个关闭H2和下一个打开H2之间定位所有文本的示例.我正在创建一个包含输出的CSV,所以我还想在文件列表中读取并将URL作为第一个结果. 解决方法require 'rubygems' require 'nokogiri' h = '<h1> Header is here</h1> <h2>Header 2 is here</h2> <p> Extract me!</p> <p> Extract me too!</p> <h2> Next Header 2</h2> <p>not interested</p> <p>not interested</p> <h2>Header 2 is here</h2> <p> Extract me!</p> <p> Extract me too!</p> ' doc = Nokogiri::HTML(h) # Specify the range between delimiter tags that you want to extract # triple dot is used to exclude the end point # 1...2 means 1 and not 2 EXTRACT_RANGES = [ 2...3,4...5 ] # Tags which count as delimiters,not to be extracted DELIMITER_TAGS = [ "h1","h2" ] extracted_text = [] i = 0 # Change /"html"/"body" to the correct path of the tag which contains this list (doc/"html"/"body").children.each do |el| if (DELIMITER_TAGS.include? el.name) i += 1 else extract = false EXTRACT_RANGES.each do |cur_range| if (cur_range.include? i) extract = true break end end if extract s = el.inner_text.strip unless s.empty? extracted_text << el.inner_text.strip end end end end # Print out extracted text (each element's inner text is separated by newlines) puts extracted_text.join("n") (编辑:李大同) 【声明】本站内容均来自网络,其相关言论仅代表作者个人观点,不代表本站立场。若无意侵犯到您的权利,请及时与联系站长删除相关内容! |