import scrapy
class WeatherItem(scrapy.Item):
define the fields for your item here like:
# name = scrapy.Field()
# demo 1
city = scrapy.Field()
date = scrapy.Field()
dayDesc = scrapy.Field()
dayTemp = scrapy.Field()
pass</code></pre>
编写爬虫(spider)
import scrapy
from weather.items import WeatherItem
class WeatherSpider(scrapy.Spider):
name = "myweather"
allowed_domains = ["sina.com.cn"]
start_urls = ['http://weather.sina.com.cn']
def parse(self,response):
item = WeatherItem()
item['city'] = response.xpath('//*[@id="slider_ct_name"]/text()').extract()
tenDay = response.xpath('//*[@id="blk_fc_c0_scroll"]');
item['date'] = tenDay.css('p.wt_fc_c0_i_date::text').extract()
item['dayDesc'] = tenDay.css('img.icons0_wt::attr(title)').extract()
item['dayTemp'] = tenDay.css('p.wt_fc_c0_i_temp::text').extract()
return item
运行爬虫
scrapy crawl myweather -o wea.json
保存数据
要保存在文件或数据库中,这里就要用到 Item Pipeline 了,那么 Item Pipeline 是什么呢?
当Item在Spider中被收集之后,它将会被传递到Item Pipeline中,一些组件会按照一定的顺序执行对Item的处理。
每个item pipeline组件(有时称之为“Item Pipeline”)是实现了简单方法的Python类。他们接收到Item并通过它执行一些行为,同时也决定此Item是否继续通过pipeline,或是被丢弃而不再进行处理。
item pipeline的典型应用有:
清理HTML数据
验证爬取的数据(检查item包含某些字段)
查重(并丢弃)
-
将爬取结果保存到文件或数据库中
# -*- coding: utf-8 -*-
class WeatherPipeline(object):
def init(self):
pass
def process_item(self,item,spider):
with open('wea.txt','w+') as file:
city = item['city'][0].encode('utf-8')
file.write('city:' + str(city) + 'nn')
date = item['date']
desc = item['dayDesc']
dayDesc = desc[1::2]
nightDesc = desc[0::2]
dayTemp = item['dayTemp']
weaitem = zip(date,dayDesc,nightDesc,dayTemp)
for i in range(len(weaitem)):
item = weaitem[i]
d = item[0]
dd = item[1]
nd = item[2]
ta = item[3].split('/')
dt = ta[0]
nt = ta[1]
txt = 'date:{0}ttday:{1}({2})ttnight:{3}({4})nn'.format(
d,dd.encode('utf-8'),dt.encode('utf-8'),nd.encode('utf-8'),nt.encode('utf-8')
)
file.write(txt)
return item</code></pre>
把 ITEM_PIPELINES 添加到设置中
ITEM_PIPELINES = {
'weather.pipelines.WeatherPipeline': 1
}
后记
爬虫的保存的时候是windows 的系统编码,输出的一些ascii编码不支持,所以报错。
解决的办法是,在输出的时候,对文件制定特定的UTF-8编码 用codecs库,
json.dumps(indent=2,ensure_ascii=False)
输出中文的json。通过使用 ensure_ascii=False,输出原有的语言文字。indent参数是缩进数量。
(编辑:李大同)
【声明】本站内容均来自网络,其相关言论仅代表作者个人观点,不代表本站立场。若无意侵犯到您的权利,请及时与联系站长删除相关内容!