python – 如何使用Scrapy获取图像文件
发布时间:2020-12-20 13:35:17 所属栏目:Python 来源:网络整理
导读:我刚刚开始使用Scrapy而我正在尝试抓取图像文件.这是我的代码. items.py from scrapy.item import Item,Fieldclass TutorialItem(Item): image_urls = Field( images = Field() pass settings.py BOT_NAME = 'tutorial'SPIDER_MODULES = ['tutorial.spiders'
我刚刚开始使用Scrapy而我正在尝试抓取图像文件.这是我的代码.
items.py from scrapy.item import Item,Field class TutorialItem(Item): image_urls = Field( images = Field() pass settings.py BOT_NAME = 'tutorial' SPIDER_MODULES = ['tutorial.spiders'] NEWSPIDER_MODULE = 'tutorial.spiders' ITEM_PIPELINES = ['scrapy.contrib.pipeline.images.ImagesPipeline'] IMAGE_STORE = '/Users/rnd/Desktop/Scrapy-0.16.5/tutorial/images' pipelines.py from scrapy.contrib.pipeline.images import ImagesPipeline from scrapy.exceptions import DropItem from scrapy.http import Request class TutorialPipeline(object): def process_item(self,item,spider): return item def get_media_requests(self,info): for image_url in item['image_urls']: yield Request(image_url) tutorial_spider.py from scrapy.spider import BaseSpider from scrapy.selector import HtmlXPathSelector from scrapy.item import Item from tutorial.items import TutorialItem class TutorialSpider(BaseSpider): name = "tutorial" allowed_domains = ["roxie.com"] start_urls = ["http://www.roxie.com/events/details.cfm?eventid=581D228B%2DB338%2DF449%2DBD69027D7D878A7F"] def parse(self,response): hxs = HtmlXPathSelector(response) item = TutorialItem() link = hxs.select('//div[@id="eventdescription"]//img/@src').extract() item['image_urls'] = ["http://www.roxie.com" + link] return item 印刷日志 – 命令>> scrapy crawl tutorial -o roxie.json -t json 2013-06-19 17:29:06-0700 [scrapy] INFO: Scrapy 0.16.5 started (bot: tutorial) /System/Library/Frameworks/Python.framework/Versions/2.6/Extras/lib/python/twisted/web/microdom.py:181: SyntaxWarning: assertion is always true,perhaps remove parentheses? assert (oldChild.parentNode is self,2013-06-19 17:29:06-0700 [scrapy] DEBUG: Enabled extensions: FeedExporter,LogStats,TelnetConsole,CloseSpider,WebService,CoreStats,SpiderState 2013-06-19 17:29:06-0700 [scrapy] DEBUG: Enabled downloader middlewares: HttpAuthMiddleware,DownloadTimeoutMiddleware,UserAgentMiddleware,RetryMiddleware,DefaultHeadersMiddleware,RedirectMiddleware,CookiesMiddleware,HttpCompressionMiddleware,ChunkedTransferMiddleware,DownloaderStats 2013-06-19 17:29:06-0700 [scrapy] DEBUG: Enabled spider middlewares: HttpErrorMiddleware,OffsiteMiddleware,RefererMiddleware,UrlLengthMiddleware,DepthMiddleware Traceback (most recent call last): File "/usr/local/bin/scrapy",line 5,in <module> pkg_resources.run_script('Scrapy==0.16.5','scrapy') File "/Library/Python/2.6/site-packages/setuptools-0.6c12dev_r88846-py2.6.egg/pkg_resources.py",line 489,in run_script File "/Library/Python/2.6/site-packages/setuptools-0.6c12dev_r88846-py2.6.egg/pkg_resources.py",line 1207,in run_script # we assume here that our metadata may be nested inside a "basket" File "/Library/Python/2.6/site-packages/Scrapy-0.16.5-py2.6.egg/EGG-INFO/scripts/scrapy",line 4,in <module> execute() File "/Library/Python/2.6/site-packages/Scrapy-0.16.5-py2.6.egg/scrapy/cmdline.py",line 131,in execute _run_print_help(parser,_run_command,cmd,args,opts) File "/Library/Python/2.6/site-packages/Scrapy-0.16.5-py2.6.egg/scrapy/cmdline.py",line 76,in _run_print_help func(*a,**kw) File "/Library/Python/2.6/site-packages/Scrapy-0.16.5-py2.6.egg/scrapy/cmdline.py",line 138,in _run_command cmd.run(args,opts) File "/Library/Python/2.6/site-packages/Scrapy-0.16.5-py2.6.egg/scrapy/commands/crawl.py",line 43,in run spider = self.crawler.spiders.create(spname,**opts.spargs) File "/Library/Python/2.6/site-packages/Scrapy-0.16.5-py2.6.egg/scrapy/command.py",line 33,in crawler self._crawler.configure() File "/Library/Python/2.6/site-packages/Scrapy-0.16.5-py2.6.egg/scrapy/crawler.py",line 41,in configure self.engine = ExecutionEngine(self,self._spider_closed) File "/Library/Python/2.6/site-packages/Scrapy-0.16.5-py2.6.egg/scrapy/core/engine.py",line 63,in __init__ self.scraper = Scraper(crawler) File "/Library/Python/2.6/site-packages/Scrapy-0.16.5-py2.6.egg/scrapy/core/scraper.py",line 66,in __init__ self.itemproc = itemproc_cls.from_crawler(crawler) File "/Library/Python/2.6/site-packages/Scrapy-0.16.5-py2.6.egg/scrapy/middleware.py",line 50,in from_crawler return cls.from_settings(crawler.settings,crawler) File "/Library/Python/2.6/site-packages/Scrapy-0.16.5-py2.6.egg/scrapy/middleware.py",line 29,in from_settings mwcls = load_object(clspath) File "/Library/Python/2.6/site-packages/Scrapy-0.16.5-py2.6.egg/scrapy/utils/misc.py",line 39,in load_object raise ImportError,"Error loading object '%s': %s" % (path,e) ImportError: Error loading object 'scrapy.contrib.pipeline.images.ImagesPipeline': No module named PIL 看起来需要PIL,所以我安装了. PIL 1.1.7 is already the active version in easy-install.pth Installing pilconvert.py script to /usr/local/bin Installing pildriver.py script to /usr/local/bin Installing pilfile.py script to /usr/local/bin Installing pilfont.py script to /usr/local/bin Installing pilprint.py script to /usr/local/bin Using /Library/Python/2.6/site-packages/PIL-1.1.7-py2.6-macosx-10.6-universal.egg Processing dependencies for pil Finished processing dependencies for pil 但是,它不起作用.你能让我知道我错过了什么吗?提前致谢! 解决方法
是的,当我开始从某些网站抓取图片时,我遇到了同样的问题.我在CentOs6.5,python2.7.6工作.我解决了它,如下所示:
yum install easy_install easy_install pip 然后以root身份登录并使用命令:pip install image.一切正常. 如果你在Ubuntu工作,我认为诀窍就是:sudo apt-get install easy_install,我认为下一个将是相同的. 我希望它会有所帮助. (编辑:李大同) 【声明】本站内容均来自网络,其相关言论仅代表作者个人观点,不代表本站立场。若无意侵犯到您的权利,请及时与联系站长删除相关内容! |