python – scrapy只抓取一个级别的网站
我正在使用scrapy来抓取域下的所有网页.
我看过this个问题.但是没有解决方案.我的问题似乎与此类似.我的crawl命令输出如下所示: scrapy crawl sjsu2012-02-22 19:41:35-0800 [scrapy] INFO: Scrapy 0.14.1 started (bot: sjsucrawler) 2012-02-22 19:41:35-0800 [scrapy] DEBUG: Enabled extensions: LogStats,TelnetConsole,CloseSpider,WebService,CoreStats,MemoryUsage,SpiderState 2012-02-22 19:41:35-0800 [scrapy] DEBUG: Enabled downloader middlewares: HttpAuthMiddleware,DownloadTimeoutMiddleware,UserAgentMiddleware,RetryMiddleware,DefaultHeadersMiddleware,RedirectMiddleware,CookiesMiddleware,HttpCompressionMiddleware,ChunkedTransferMiddleware,DownloaderStats 2012-02-22 19:41:35-0800 [scrapy] DEBUG: Enabled spider middlewares: HttpErrorMiddleware,OffsiteMiddleware,RefererMiddleware,UrlLengthMiddleware,DepthMiddleware 2012-02-22 19:41:35-0800 [scrapy] DEBUG: Enabled item pipelines: 2012-02-22 19:41:35-0800 [sjsu] INFO: Spider opened 2012-02-22 19:41:35-0800 [sjsu] INFO: Crawled 0 pages (at 0 pages/min),scraped 0 items (at 0 items/min) 2012-02-22 19:41:35-0800 [scrapy] DEBUG: Telnet console listening on 0.0.0.0:6023 2012-02-22 19:41:35-0800 [scrapy] DEBUG: Web service listening on 0.0.0.0:6080 2012-02-22 19:41:35-0800 [sjsu] DEBUG: Crawled (200) <GET http://cs.sjsu.edu/> (referer: None) 2012-02-22 19:41:35-0800 [sjsu] INFO: Closing spider (finished) 2012-02-22 19:41:35-0800 [sjsu] INFO: Dumping spider stats: {'downloader/request_bytes': 198,'downloader/request_count': 1,'downloader/request_method_count/GET': 1,'downloader/response_bytes': 11000,'downloader/response_count': 1,'downloader/response_status_count/200': 1,'finish_reason': 'finished','finish_time': datetime.datetime(2012,2,23,3,41,35,788155),'scheduler/memory_enqueued': 1,'start_time': datetime.datetime(2012,379951)} 2012-02-22 19:41:35-0800 [sjsu] INFO: Spider closed (finished) 2012-02-22 19:41:35-0800 [scrapy] INFO: Dumping global stats: {'memusage/max': 29663232,'memusage/startup': 29663232} 这里的问题是抓取从第一页找到链接,但不访问它们.什么是使用这样的爬虫. 编辑: 我的抓取代码是: from scrapy.spider import BaseSpider from scrapy.selector import HtmlXPathSelector class SjsuSpider(BaseSpider): name = "sjsu" allowed_domains = ["sjsu.edu"] start_urls = [ "http://cs.sjsu.edu/" ] def parse(self,response): filename = "sjsupages" open(filename,'wb').write(response.body) 我的所有其他设置都是默认设置. 解决方法
我认为最好的方法是使用Crawlspider.所以你必须修改你的代码,以便能够从第一页找到所有链接并访问它们:
class SjsuSpider(CrawlSpider): name = 'sjsu' allowed_domains = ['sjsu.edu'] start_urls = ['http://cs.sjsu.edu/'] # allow=() is used to match all links rules = [Rule(SgmlLinkExtractor(allow=()),callback='parse_item')] def parse_item(self,response): x = HtmlXPathSelector(response) filename = "sjsupages" # open a file to append binary data open(filename,'ab').write(response.body) 如果您想抓取网站中的所有链接(而不仅仅是第一级中的链接), rules = [ Rule(SgmlLinkExtractor(allow=()),follow=True),Rule(SgmlLinkExtractor(allow=()),callback='parse_item') ] 我已将’parse’回调更改为’parse_item’,原因如下:
有关详细信息,请参阅:http://doc.scrapy.org/en/0.14/topics/spiders.html#crawlspider (编辑:李大同) 【声明】本站内容均来自网络,其相关言论仅代表作者个人观点,不代表本站立场。若无意侵犯到您的权利,请及时与联系站长删除相关内容! |