python – 在for循环中运行多个spider
发布时间:2020-12-20 13:08:39 所属栏目:Python 来源:网络整理
导读:我尝试实例化多个蜘蛛.第一个工作正常,但第二个给我一个错误:ReactorNotRestartable. feeds = { 'nasa': { 'name': 'nasa','url': 'https://www.nasa.gov/rss/dyn/breaking_news.rss','start_urls': ['https://www.nasa.gov/rss/dyn/breaking_news.rss'] },
我尝试实例化多个蜘蛛.第一个工作正常,但第二个给我一个错误:ReactorNotRestartable.
feeds = { 'nasa': { 'name': 'nasa','url': 'https://www.nasa.gov/rss/dyn/breaking_news.rss','start_urls': ['https://www.nasa.gov/rss/dyn/breaking_news.rss'] },'xkcd': { 'name': 'xkcd','url': 'http://xkcd.com/rss.xml','start_urls': ['http://xkcd.com/rss.xml'] } } 通过上面的项目,我尝试在循环中运行两个蜘蛛,如下所示: from scrapy.crawler import CrawlerProcess from scrapy.spiders import XMLFeedSpider class MySpider(XMLFeedSpider): name = None def __init__(self,**kwargs): this_feed = feeds[self.name] self.start_urls = this_feed.get('start_urls') self.iterator = 'iternodes' self.itertag = 'items' super(MySpider,self).__init__(**kwargs) def parse_node(self,response,node): pass def start_crawler(): process = CrawlerProcess({ 'USER_AGENT': CONFIG['USER_AGENT'],'DOWNLOAD_HANDLERS': {'s3': None} # boto issues }) for feed_name in feeds.keys(): MySpider.name = feed_name process.crawl(MySpider) process.start() 第二个循环的例外看起来像这样,蜘蛛打开了,但随后: ... 2015-11-22 00:00:00 [scrapy] INFO: Spider opened 2015-11-22 00:00:00 [scrapy] INFO: Crawled 0 pages (at 0 pages/min),scraped 0 items (at 0 items/min) 2015-11-22 00:00:00 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023 2015-11-21 23:54:05 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023 Traceback (most recent call last): File "env/bin/start_crawler",line 9,in <module> load_entry_point('feed-crawler==0.0.1','console_scripts','start_crawler')() File "/Users/bling/py-feeds-crawler/feed_crawler/crawl.py",line 51,in start_crawler process.start() # the script will block here until the crawling is finished File "/Users/bling/py-feeds-crawler/env/lib/python2.7/site-packages/scrapy/crawler.py",line 251,in start reactor.run(installSignalHandlers=False) # blocking call File "/usr/local/lib/python2.7/site-packages/twisted/internet/base.py",line 1193,in run self.startRunning(installSignalHandlers=installSignalHandlers) File "/usr/local/lib/python2.7/site-packages/twisted/internet/base.py",line 1173,in startRunning ReactorBase.startRunning(self) File "/usr/local/lib/python2.7/site-packages/twisted/internet/base.py",line 684,in startRunning raise error.ReactorNotRestartable() twisted.internet.error.ReactorNotRestartable 我是否必须使第一个MySpider无效或我做错了什么,需要改变它的工作原理.提前致谢. 解决方法
看起来你必须为每个蜘蛛实例化一个进程,尝试:
def start_crawler(): for feed_name in feeds.keys(): process = CrawlerProcess({ 'USER_AGENT': CONFIG['USER_AGENT'],'DOWNLOAD_HANDLERS': {'s3': None} # boto issues }) MySpider.name = feed_name process.crawl(MySpider) process.start() (编辑:李大同) 【声明】本站内容均来自网络,其相关言论仅代表作者个人观点,不代表本站立场。若无意侵犯到您的权利,请及时与联系站长删除相关内容! |