python – 使用Scrapy将参数传递给回调函数,因此可以在以后接收

发布时间：2020-12-20 11:56:18 所属栏目：Python 来源：网络整理

导读：我试图让这个蜘蛛工作,如果要求分别刮下它的组件,它可以工作,但是当我尝试使用Srapy回调函数来接收参数后,我会崩溃.目标是在输出json文件中以格式写入时抓取多个页面并刮取数据：作者|专辑|标题|歌词每个数据都位于不同的网页上,这就是我为什么要使用Scrap

我试图让这个蜘蛛工作,如果要求分别刮下它的组件,它可以工作,但是当我尝试使用Srapy回调函数来接收参数后,我会崩溃.目标是在输出json文件中以格式写入时抓取多个页面并刮取数据：

作者|专辑|标题|歌词

每个数据都位于不同的网页上,这就是我为什么要使用Scrapy回调函数来实现这一目标的原因.

此外,上述每个项目都在Scrapy items.py下定义为：

import scrapy

class TutorialItem(scrapy.Item):
# define the fields for your item here like:
# name = scrapy.Field()
    author = scrapy.Field()
    album = scrapy.Field()
    title = scrapy.Field()
    lyrics = scrapy.Field()

蜘蛛代码从这里开始：

import scrapy
import re
import json

from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider,Rule
from tutorial.items import TutorialItem


# urls class
class DomainSpider(scrapy.Spider):
    name = "domainspider"
    allowed_domains = ['www.domain.com']
    start_urls = [
        'http://www.domain.com',]

    rules = (
        Rule(LinkExtractor(allow='www.domain.com/[A-Z][a-zA-Z_/]+$'),'parse',follow=True,),)

    # Parsing start here
    # crawling and scraping the links from menu list
    def parse(self,response):
        links = response.xpath('//html/body/nav[1]/div/ul/li/div/a/@href')

        for link in links:
            next_page_link = link.extract()
            if next_page_link:
                next_page = response.urljoin(next_page_link)
                yield scrapy.Request(next_page,callback=self.parse_artist_page)

    # crawling and scraping artist names and links
    def parse_artist_page(self,response):
        artist_links = response.xpath('//*/div[contains(@class,"artist-col")]/a/@href')
        author = response.xpath('//*/div[contains(@class,"artist-col")]/a/text()').extract()

        item = TutorialItem(author=author)

        for link in artist_links:
            next_page_link = link.extract()
            if next_page_link:
                next_page = response.urljoin(next_page_link)
                yield scrapy.Request(next_page,callback=self.parse_album_page)

                request.meta['author'] = item
                yield item
                return

    # crawling and scraping album names and links
    def parse_album_page(self,response):
        album_links = response.xpath('//*/div[contains(@id,"listAlbum")]/a/@href')
        album = response.xpath('//*/div[contains(@class,"album")]/b/text()').extract()

        item = TutorialItem(album=album)

        for link in album_links:
            next_page_link = link.extract()
            if next_page_link:
                next_page = response.urljoin(next_page_link)
                yield scrapy.Request(next_page,callback=self.parse_lyrics_page)

                request.meta['album'] = item
                yield item
                return

    # crawling and scraping titles and lyrics
    def parse_lyrics_page(self,response):
        title = response.xpath('//html/body/div[3]/div/div[2]/b/text()').extract()
        lyrics = map(unicode.strip,response.xpath('//html/body/div[3]/div/div[2]/div[6]/text()').extract())

        item = response.meta['author','album']
        item = TutorialItem(author=author,album=album,title=title,lyrics=lyrics)
        yield item

转到回调函数时代码崩溃：

request.meta['author'] = item
yield item
return

有人可以帮忙吗？

解决方法

我确实找到了问题所在,回调函数的设置方式由我设置,现在有效：

# crawling and scraping artist names and links
    def parse_artist_page(self,"artist-col")]/a/text()').extract()

        for link in artist_links:
            next_page_link = link.extract()
            if next_page_link:
                next_page = response.urljoin(next_page_link)
                request = scrapy.Request(next_page,callback=self.parse_album_page)
                request.meta['author'] = author
                return request

    # crawling and scraping album names and links
    def parse_album_page(self,response):
        author = response.meta.get('author')

        album_links = response.xpath('//*/div[contains(@id,"album")]/b/text()').extract()


        for link in album_links:
            next_page_link = link.extract()
            if next_page_link:
                next_page = response.urljoin(next_page_link)
                request = scrapy.Request(next_page,callback=self.parse_lyrics_page)
                request.meta['author'] = author
                request.meta['album'] = album
                return request

    # crawling and scraping song titles and lyrics
    def parse_lyrics_page(self,response):
        author = response.meta.get('author')
        album = response.meta.get('album')

        title = response.xpath('//html/body/div[3]/div/div[2]/b/text()').extract()
        lyrics = map(unicode.strip,response.xpath('//html/body/div[3]/div/div[2]/div[6]/text()').extract())

        item = TutorialItem(author=author,lyrics=lyrics)
        yield item

（编辑：李大同）

【声明】本站内容均来自网络，其相关言论仅代表作者个人观点，不代表本站立场。若无意侵犯到您的权利，请及时与联系站长删除相关内容!