pyspider architecture--官方文档
原文地址:http://docs.pyspider.org/en/latest/Architecture/ This document describes the reason why I made pyspider and the architecture. Two years ago,I was working on a vertical search engine. We are facing following needs on crawling:
Furthermore,we have some APIs from our cooperators,the API may need POST,proxy,request signature etc. Full control from script is more convenient than some global parameters of components. The following diagram shows an overview of the pyspider architecture with its components and an outline of the data flow that takes place inside the system. Components are connected by message queue. Every component,including message queue,is running in their own process/thread,and replaceable. That means,when process is slow,you can have many instances of processor and make full use of multiple CPUs,or deploy to multiple machines. This architecture makes pyspider really fast.?. The Scheduler receives tasks from newtask_queue from processor. Decide whether the task is new or requires re-crawl. Sort tasks according to priority and feeding them to fetcher with traffic control (?algorithm). Take care of periodic tasks,lost tasks and failed tasks and retry later. All of above can be set via? Note that in current implement of scheduler,only one scheduler is allowed. The Fetcher is responsible for fetching web pages then send results to processor. For flexible,fetcher support??and pages that rendered by JavaScript (via?). Fetch method,headers,cookies,etag etc can be controlled by script via?. Phantomjs Fetcher works like a proxy. It's connected to general Fetcher,fetch and render pages with JavaScript enabled,output a general HTML back to Fetcher:
The Processor is responsible for running the script written by users to parse and extract information. Your script is running in an unlimited environment. Although we have various tools(like?) for you to extract information and links,you can use anything you want to deal with the response. You may refer to??and??to get more information about script. Processor will capture the exceptions and logs,send status(task track) and new tasks to? Result worker receives results from? WebUI is a web frontend for everything. It contains:
Maybe webui is the most attractive part of pyspider. With this powerful UI,you can debug your scripts step by step just as pyspider do. Starting or stop a project. Finding which project is going wrong and what request is failed and try it again with debugger. The data flow in pyspider is just as your seen in diagram above:
(编辑:李大同) 【声明】本站内容均来自网络,其相关言论仅代表作者个人观点,不代表本站立场。若无意侵犯到您的权利,请及时与联系站长删除相关内容! |
- python – 切片3d numpy数组返回奇怪的形状
- Python多线程同步Lock、RLock、Semaphore、Event实例
- Anaconda(miniconda)安装及使用--转
- python – 二进制(像素化)图像中的基本模式识别
- Python datetime时间格式化去掉前导0
- 优雅的pythonic解决方案,用于在Unicode字符串的嵌套字典中强
- 检测在LDAP中是否使用LDAP锁定Active Directory用户帐户
- Python3解leetcode Subtree of Another Tree
- python爬虫实战之最简单的网页爬虫教程
- python – Flask-Admin:应用过滤器的模型视图路径