python – scrapy_redis在IDLE的x时间之后停止我的蜘蛛
发布时间:2020-12-20 13:17:24 所属栏目:Python 来源:网络整理
导读:我有一个scrapy_redis蜘蛛池,可以监听redis队列(蜘蛛的数量并不总是相同).此队列由另一个脚本提供.我希望我的蜘蛛在X分钟不活动后停止,当redis队列中没有任何东西时. 我在settings.py中设置了SCHEDULER_IDLE_BEFORE_CLOSE,但它似乎不起作用. 这是我的setting
我有一个scrapy_redis蜘蛛池,可以监听redis队列(蜘蛛的数量并不总是相同).此队列由另一个脚本提供.我希望我的蜘蛛在X分钟不活动后停止,当redis队列中没有任何东西时.
我在settings.py中设置了SCHEDULER_IDLE_BEFORE_CLOSE,但它似乎不起作用. 这是我的settings.py: SCHEDULER = "scrapy_redis.scheduler.Scheduler" DUPEFILTER_CLASS = "scrapy_redis.dupefilter.RFPDupeFilter" SCHEDULER_IDLE_BEFORE_CLOSE = 10 REDIS_HOST = 'localhost' DOWNLOADER_MIDDLEWARES = { 'serp_crawl.middlewares.RandomUserAgentMiddleware': 200,'scrapy_crawlera.CrawleraMiddleware': 300 } CRAWLERA_ENABLED = True CRAWLERA_USER = '' CRAWLERA_PASS = '' #Activate Crawlera User Agent DEFAULT_REQUEST_HEADERS = { "X-Crawlera-UA": "pass",} UPDATE 这是我的蜘蛛代码: from scrapy_redis.spiders import RedisSpider from elasticsearch import Elasticsearch from serp_crawl.settings import * from datetime import datetime from redis import Redis import scrapy import json class SerpSpider(RedisSpider): name = "serpcrawler" redis_key = 'serp_crawler:request' def __init__(self,redis_host='localhost',redis_port='6379',elasticsearch_host='localhost',elasticsearch_port='9200',mysql_host='localhost',dev=False,): super(SerpSpider,self).__init__() self.platform = None self.dev = bool(dev) self.q = Redis(redis_host,redis_port) self.es = Elasticsearch([{'host': elasticsearch_host,'port': elasticsearch_port}]) @classmethod def from_crawler(self,crawler,*args,**kwargs): crawler.settings.attributes['REDIS_HOST'].value = kwargs['redis_host'] obj = super(RedisSpider,self).from_crawler(crawler,**kwargs) obj.setup_redis(crawler) return obj def make_requests_from_url(self,url): data = json.loads(url) self.logger.info('Got new url to parse: ',data['url']) self.settings.attributes['DEFAULT_REQUEST_HEADERS'].value.attributes['X-Crawlera-UA'].value = data['platform'] self.platform = data['platform'] return scrapy.Request(url=data['url'],callback=self.parse,meta={'keyword': data['keyword'],'id': data['id_keyword'],'country': data['country'],'platform': data['platform']},dont_filter=True) def parse(self,response): doc = dict() try: doc['content'] = response.body.decode('cp1252') except: doc['content'] = response.body doc['date'] = datetime.now().strftime('%Y-%m-%d') doc['keyword'] = str(response.meta['keyword']) doc['type_platform'] = str(response.meta['platform']) doc['country'] = str(response.meta['country']) if not self.dev: id_index = self.es.index(index='serp_html',doc_type='page',body=doc) self.q.lpush('batching_serp',{'id_index': str(id_index['_id']),'type_batching': 'default','country': doc['country'],'type_platform': doc['type_platform'],'keyword': doc['keyword'],'id_keyword': int(response.meta['id'])}) self.logger.info('Indexed new page. id_es : [' + str(id_index['_id']) + ']') 谢谢你的帮助. 解决方法
scrapy-redis文档说:
# Max idle time to prevent the spider from being closed when distributed crawling. # This only works if queue class is SpiderQueue or SpiderStack,# and may also block the same time when your spider start at the first time (because the queue is empty). SCHEDULER_IDLE_BEFORE_CLOSE = 10 因此,您需要设置以下任一设置: SCHEDULER_QUEUE_CLASS = 'scrapy_redis.queue.FifoQueue' # or SCHEDULER_QUEUE_CLASS = 'scrapy_redis.queue.LifoQueue' (编辑:李大同) 【声明】本站内容均来自网络,其相关言论仅代表作者个人观点,不代表本站立场。若无意侵犯到您的权利,请及时与联系站长删除相关内容! |