python – 使用Scrapy将参数传递给回调函数,因此可以在以后接收
发布时间:2020-12-20 11:56:18 所属栏目:Python 来源:网络整理
导读:我试图让这个蜘蛛工作,如果要求分别刮下它的组件,它可以工作,但是当我尝试使用Srapy回调函数来接收参数后,我会崩溃.目标是在输出json文件中以格式写入时抓取多个页面并刮取数据: 作者|专辑|标题|歌词 每个数据都位于不同的网页上,这就是我为什么要使用Scrap
我试图让这个蜘蛛工作,如果要求分别刮下它的组件,它可以工作,但是当我尝试使用Srapy回调函数来接收参数后,我会崩溃.目标是在输出json文件中以格式写入时抓取多个页面并刮取数据:
作者|专辑|标题|歌词 每个数据都位于不同的网页上,这就是我为什么要使用Scrapy回调函数来实现这一目标的原因. 此外,上述每个项目都在Scrapy items.py下定义为: import scrapy class TutorialItem(scrapy.Item): # define the fields for your item here like: # name = scrapy.Field() author = scrapy.Field() album = scrapy.Field() title = scrapy.Field() lyrics = scrapy.Field() 蜘蛛代码从这里开始: import scrapy import re import json from scrapy.linkextractors import LinkExtractor from scrapy.spiders import CrawlSpider,Rule from tutorial.items import TutorialItem # urls class class DomainSpider(scrapy.Spider): name = "domainspider" allowed_domains = ['www.domain.com'] start_urls = [ 'http://www.domain.com',] rules = ( Rule(LinkExtractor(allow='www.domain.com/[A-Z][a-zA-Z_/]+$'),'parse',follow=True,),) # Parsing start here # crawling and scraping the links from menu list def parse(self,response): links = response.xpath('//html/body/nav[1]/div/ul/li/div/a/@href') for link in links: next_page_link = link.extract() if next_page_link: next_page = response.urljoin(next_page_link) yield scrapy.Request(next_page,callback=self.parse_artist_page) # crawling and scraping artist names and links def parse_artist_page(self,response): artist_links = response.xpath('//*/div[contains(@class,"artist-col")]/a/@href') author = response.xpath('//*/div[contains(@class,"artist-col")]/a/text()').extract() item = TutorialItem(author=author) for link in artist_links: next_page_link = link.extract() if next_page_link: next_page = response.urljoin(next_page_link) yield scrapy.Request(next_page,callback=self.parse_album_page) request.meta['author'] = item yield item return # crawling and scraping album names and links def parse_album_page(self,response): album_links = response.xpath('//*/div[contains(@id,"listAlbum")]/a/@href') album = response.xpath('//*/div[contains(@class,"album")]/b/text()').extract() item = TutorialItem(album=album) for link in album_links: next_page_link = link.extract() if next_page_link: next_page = response.urljoin(next_page_link) yield scrapy.Request(next_page,callback=self.parse_lyrics_page) request.meta['album'] = item yield item return # crawling and scraping titles and lyrics def parse_lyrics_page(self,response): title = response.xpath('//html/body/div[3]/div/div[2]/b/text()').extract() lyrics = map(unicode.strip,response.xpath('//html/body/div[3]/div/div[2]/div[6]/text()').extract()) item = response.meta['author','album'] item = TutorialItem(author=author,album=album,title=title,lyrics=lyrics) yield item 转到回调函数时代码崩溃: request.meta['author'] = item yield item return 有人可以帮忙吗? 解决方法
我确实找到了问题所在,回调函数的设置方式由我设置,现在有效:
# crawling and scraping artist names and links def parse_artist_page(self,"artist-col")]/a/text()').extract() for link in artist_links: next_page_link = link.extract() if next_page_link: next_page = response.urljoin(next_page_link) request = scrapy.Request(next_page,callback=self.parse_album_page) request.meta['author'] = author return request # crawling and scraping album names and links def parse_album_page(self,response): author = response.meta.get('author') album_links = response.xpath('//*/div[contains(@id,"album")]/b/text()').extract() for link in album_links: next_page_link = link.extract() if next_page_link: next_page = response.urljoin(next_page_link) request = scrapy.Request(next_page,callback=self.parse_lyrics_page) request.meta['author'] = author request.meta['album'] = album return request # crawling and scraping song titles and lyrics def parse_lyrics_page(self,response): author = response.meta.get('author') album = response.meta.get('album') title = response.xpath('//html/body/div[3]/div/div[2]/b/text()').extract() lyrics = map(unicode.strip,response.xpath('//html/body/div[3]/div/div[2]/div[6]/text()').extract()) item = TutorialItem(author=author,lyrics=lyrics) yield item (编辑:李大同) 【声明】本站内容均来自网络,其相关言论仅代表作者个人观点,不代表本站立场。若无意侵犯到您的权利,请及时与联系站长删除相关内容! |