scrapy框架

发布时间：2020-12-20 10:21:52 所属栏目：Python 来源：网络整理

导读：Scrapy框架： spiders 发送请求 ==引擎== 调度器scheduler==Downloader下载器,响应文件==spiders==处理数据，item,pipeline.? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? 新建项目（scrapy startproject xxx）：新建一个爬虫项目明确

Scrapy框架：

spiders 发送请求 ==>引擎==> 调度器scheduler==>Downloader下载器,响应文件==>spiders==>处理数据，item,pipeline.? ?　　　　　　　　　　　　　　　　　　　　　　　　　　? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?

新建项目（scrapy startproject xxx）：新建一个爬虫项目

明确目标（编写items.py）：明确抓取的目标

制作爬虫（spiders/xxspider.py）：制作爬虫开始爬取数据

存储内容（pipelines.py）：设计管道存储爬取内容

运行爬虫项目：

命令行运行：scrapy? crawl? myspider

pycharm运行：from scrapy import cmdline
　　　　　　cmdline.execute(‘scrapy crawl myspider‘.split(" "))

?管道：

先在settings.py里面：

ITEM_PIPELINES = {
#     ‘mySpider.pipelines.mySpiderPipelines‘:100,
   ‘mySpider.pipelines.MyspiderPipeline‘: 300,
}

?然后在pipelines.py里面：

import json

class MyspiderPipeline(object):
    def __init__(self):
        self.filename = open(‘teacher.json‘,‘w‘,encoding=‘utf8‘)

    # 处理item数据
    def process_item(self,item,spider):
        jsontxt = json.dumps(dict(item),ensure_ascii=False)+ "n"
        self.filename.write(jsontxt)
        # return item

    # 结束调用
    def close_spider(self,spider):
        self.filename.close()

?回调函数到下一页：myspider.py:写在for循环外

# 将请求重新发送给调度器入队列，交给下载器下载

yield? scrapy.Request(self.url+str(self.offest),callback = self.parse)

?设置报头：

DEFAULT_REQUEST_HEADERS = {
  ‘User-Agent‘:‘Mozilla/5.0(compatible; MSIE 9.0;Windows NT 6.1;Trident/5.0;‘,
  ‘Accept‘: ‘text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8‘,
  # ‘Accept-Language‘: ‘en‘,
}

?设置延迟：

#DOWNLOAD_DELAY = 3

?设置管道：

ITEM_PIPELINES = {
#     ‘mySpider.pipelines.mySpiderPipelines‘:100,
}

管道处理文字：

import json

class MyspiderPipeline(object):
    def __init__(self):
        self.filename = open(‘teacher.json‘,spider):
        self.filename.close()

管道处理图片：

import scrapyfrom scrapy.utils.project import get_project_settingsfrom scrapy.pipelines.images import ImagesPipelineimport osclass ImagesPipeline(ImagesPipeline):    #def process_item(self,spider):    #    return item    # 获取settings文件里设置的变量值    IMAGES_STORE = get_project_settings().get("IMAGES_STORE")    def get_media_requests(self,info):        image_url = item["imagelink"]        yield scrapy.Request(image_url)    def item_completed(self,result,info):        image_path = [x["path"] for ok,x in result if ok]        os.rename(self.IMAGES_STORE + "/" + image_path[0],self.IMAGES_STORE + "/" + item["nickname"] + ".jpg")        item["imagePath"] = self.IMAGES_STORE + "/" + item["nickname"]        return item

（编辑：李大同）

【声明】本站内容均来自网络，其相关言论仅代表作者个人观点，不代表本站立场。若无意侵犯到您的权利，请及时与联系站长删除相关内容!