python – 两个关键字之间的scrapy xpath
发布时间:2020-12-20 13:41:46 所属栏目:Python 来源:网络整理
导读:我试图在2个关键字之间提取一些文本信息,如下所示: item['duties']=titles.select('.//span/text()[following-sibling::*[text()="Qualifications/Duties" and preceding-sibling::*text()="Entity Information"]').extract() 蜘蛛: from scrapy.contrib.s
我试图在2个关键字之间提取一些文本信息,如下所示:
item['duties']=titles.select('.//span/text()[following-sibling::*[text()="Qualifications/Duties" and preceding-sibling::*text()="Entity Information"]').extract() 蜘蛛: from scrapy.contrib.spiders import CrawlSpider,Rule from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor from scrapy.http import request from scrapy.selector import HtmlXPathSelector from health.items import HealthItem class healthspider(CrawlSpider): name="health" allowed_domains=['mysite'] start_urls=['myurl'] rules=( Rule(SgmlLinkExtractor(allow=("search/",)),callback="parse_health",follow=True),Rule(SgmlLinkExtractor(allow=("url",callback="parse_job",) def parse_job(self,response): hxs=HtmlXPathSelector(response) titles=hxs.select('//*[@itemprop="description"]') items = [] for titles in titles: item=HealthItem() item['duties']=titles.select('.//span[following-sibling::*[text()="Qualifications/Duties" and preceding-sibling::*text()="Entity Information"]/text()').extract() item['entity_info']=titles.select('.//span/text()[45]').extract() items.append(item) print items return items 但我得到一个错误说: raise ValueError("Invalid XPath: %s" % query) exceptions.ValueError: Invalid XPath: .//span/text()[following-sibling::*[text()="Qualifications/Duties" and preceding-sibling::*text()="Entity Information"] 有没有办法在我的蜘蛛中定义这样的xpath? 解决方法
几个选项:
使用node()测试选择(文本节点和元素节点) In [1]: sel.xpath(""".//node()[preceding-sibling::*="Qualifications/Duties"] [following-sibling::*="Entity Information"]""").extract() Out[1]: [u'<br>',u'Texas Health Presbyterian Allen is currently in search of a Registered Nurse to help meet the growing needs of our Day Surgery Department to work PRN in Day Surgery and also float to PACU.',u'<br>',u'<b>Basic Qualifications:</b>',u'*Graduate of an accredited school of nursing',u'*Valid RN license in the state of Texas',u'*BLS',u'*ACLS',u'*PALS within 6 months of hire',u'*Minimum of 1 - 3 years experience as RN in Day Surgery,PACU,Outpatient Surgery,or ICU',u'*Strong organizational skills and ability to function in a fast paced work environment',u'*Ability to accept responsibility and show initiative to work without direct supervision',u'*A high degree of confidentiality,positive interpersonal skills and ability to function in a fast-paced environment',u'<b>Preferred Qualifications:</b>',u'*Three years RN experience in Outpatient Surgery along with some ICU experience.',u'*PALS',u'*PACU,Endoscopy or Ambulatory setting',u'*IV Conscious Sedation',u'<b>Hours/Schedule:</b>',u'*Variable',u'J2WPeriop',u'<br>'] 仅选择文本节点:(您“松开”粗线) In [2]: sel.xpath(""".//text()[preceding-sibling::*="Qualifications/Duties"] [following-sibling::*="Entity Information"]""").extract() Out[2]: [u'Texas Health Presbyterian Allen is currently in search of a Registered Nurse to help meet the growing needs of our Day Surgery Department to work PRN in Day Surgery and also float to PACU.',u'J2WPeriop'] 从粗体行中选择希望文本节点的节的兄弟节点:: In [3]: sel.xpath(""".//*[preceding-sibling::*="Qualifications/Duties"] [following-sibling::*="Entity Information"]/text() | .//text()[preceding-sibling::*="Qualifications/Duties"] [following-sibling::*="Entity Information"]""").extract() Out[3]: [u'Texas Health Presbyterian Allen is currently in search of a Registered Nurse to help meet the growing needs of our Day Surgery Department to work PRN in Day Surgery and also float to PACU.',u'Basic Qualifications:',u'Preferred Qualifications:',u'Hours/Schedule:',u'J2WPeriop'] (编辑:李大同) 【声明】本站内容均来自网络,其相关言论仅代表作者个人观点,不代表本站立场。若无意侵犯到您的权利,请及时与联系站长删除相关内容! |