加入收藏 | 设为首页 | 会员中心 | 我要投稿 李大同 (https://www.lidatong.com.cn/)- 科技、建站、经验、云计算、5G、大数据,站长网!
当前位置: 首页 > 编程开发 > Python > 正文

1.1 官网地址

发布时间:2020-12-16 23:58:50 所属栏目:Python 来源:网络整理
导读:原文地址:http://www.jianshu.com/p/c3fc3129407d 1. 爬虫框架webmagic WebMagic是一个简单灵活的爬虫框架。基于WebMagic,你可以快速开发出一个高效、易维护的爬虫。 1.1 官网地址 官网文档写的比较清楚,建议大家直接阅读官方文档,也可以阅读下面的内容

原文地址:http://www.jianshu.com/p/c3fc3129407d

1. 爬虫框架webmagic

WebMagic是一个简单灵活的爬虫框架。基于WebMagic,你可以快速开发出一个高效、易维护的爬虫。

1.1 官网地址

官网文档写的比较清楚,建议大家直接阅读官方文档,也可以阅读下面的内容。地址如下:

官网:

中文文档地址:

English:

2. webmagic与spring boot框架集成

spring bootwebmagic的结合主要有三个模块,分别为爬取模块Processor,入库模块Pipeline,向数据库存入爬取数据,和定时任务模块Scheduled,复制定时爬取网站数据。

2.1 maven添加

 
    us.codecraft
    webmagic-core
    0.5.3


    us.codecraft
    webmagic-extension
    0.5.3

2.2 爬取模块Processor

爬取简书首页Processor,分析简书首页的页面数据,获取响应的简书链接和标题,放入wegmagic的Page中,到入库模块取出添加到数据库。代码如下:

<span class="hljs-keyword">import com.shang.spray.entity.News;
<span class="hljs-keyword">import com.shang.spray.entity.Sources;
<span class="hljs-keyword">import com.shang.spray.pipeline.NewsPipeline;
<span class="hljs-keyword">import us.codecraft.webmagic.Page;
<span class="hljs-keyword">import us.codecraft.webmagic.Site;
<span class="hljs-keyword">import us.codecraft.webmagic.Spider;
<span class="hljs-keyword">import us.codecraft.webmagic.processor.PageProcessor;
<span class="hljs-keyword">import us.codecraft.webmagic.selector.Selectable;

<span class="hljs-keyword">import java.util.List;

<span class="hljs-comment">/**

  • info:简书首页爬虫

  • Created by shang on 16/9/9.
    */
    <span class="hljs-keyword">public <span class="hljs-class"><span class="hljs-keyword">class <span class="hljs-title">JianShuProcessor <span class="hljs-keyword"><span class="hljs-keyword">implements <span class="hljs-type">PageProcessor {

    <span class="hljs-keyword">private Site site = Site.me()
    .setDomain(<span class="hljs-string">"jianshu.com")
    .setSleepTime(<span class="hljs-number">100)
    .setUserAgent(<span class="hljs-string">"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/537.36 (KHTML,like Gecko) Chrome/52.0.2743.116 Safari/537.36");
    ;

    <span class="hljs-keyword">public <span class="hljs-keyword">static final <span class="hljs-keyword">String list = <span class="hljs-string">"http://www.jianshu.com";

    @Override
    <span class="hljs-keyword">public void process(Page page) {
    <span class="hljs-keyword">if (page.getUrl().regex(list).match()) {
    List list=page.getHtml().xpath(<span class="hljs-string">"//ul[@class='article-list thumbnails']/li").nodes();
    <span class="hljs-keyword">for (Selectable s : <span class="hljs-type">list) {
    <span class="hljs-keyword">String title=s.xpath(<span class="hljs-string">"//div/h4/a/text()").toString();
    <span class="hljs-keyword">String link=s.xpath(<span class="hljs-string">"//div/h4").links().toString();
    News <span class="hljs-keyword">new<span class="hljs-type">s=<span class="hljs-keyword">new <span class="hljs-type">News();
    <span class="hljs-keyword">new<span class="hljs-type">s.setTitle(title);
    <span class="hljs-keyword">new<span class="hljs-type">s.setInfo(title);
    <span class="hljs-keyword">new<span class="hljs-type">s.setLink(link);
    <span class="hljs-keyword">new<span class="hljs-type">s.setSources(<span class="hljs-keyword">new <span class="hljs-type">Sources(<span class="hljs-number">5));
    page.putField(<span class="hljs-string">"news"+title,<span class="hljs-keyword">new<span class="hljs-type">s);
    }
    }
    }

    @Override
    <span class="hljs-keyword">public Site getSite() {
    <span class="hljs-keyword">return site;
    }

    <span class="hljs-keyword">public <span class="hljs-keyword">static void main(<span class="hljs-keyword">String[] args) {
    Spider spider=Spider.create(<span class="hljs-keyword">new <span class="hljs-type">JianShuProcessor());
    spider.addUrl(<span class="hljs-string">"http://www.jianshu.com");
    spider.addPipeline(<span class="hljs-keyword">new <span class="hljs-type">NewsPipeline());
    spider.thread(<span class="hljs-number">5);
    spider.setExitWhenComplete(<span class="hljs-literal">true);
    spider.start();
    }
    }

2.3 入库模块Pipeline

入库模块结合spring boot的Repository模块一起组合成入库方法,继承webmagic的Pipeline,然后实现方法,在process方法中获取爬虫模块的数据,然后调用spring boot的save方法。代码如下:


<span class="hljs-keyword">import com.shang.spray.entity.News;
<span class="hljs-keyword">import com.shang.spray.entity.Sources;
<span class="hljs-keyword">import com.shang.spray.repository.NewsRepository;
<span class="hljs-keyword">import org.apache.commons.lang3.StringUtils;
<span class="hljs-keyword">import org.springframework.beans.factory.annotation.Autowired;
<span class="hljs-keyword">import org.springframework.data.jpa.domain.Specification;
<span class="hljs-keyword">import org.springframework.stereotype.Repository;
<span class="hljs-keyword">import us.codecraft.webmagic.ResultItems;
<span class="hljs-keyword">import us.codecraft.webmagic.Task;
<span class="hljs-keyword">import us.codecraft.webmagic.pipeline.Pipeline;

<span class="hljs-keyword">import javax.persistence.criteria.CriteriaBuilder;
<span class="hljs-keyword">import javax.persistence.criteria.CriteriaQuery;
<span class="hljs-keyword">import javax.persistence.criteria.Predicate;
<span class="hljs-keyword">import javax.persistence.criteria.Root;
<span class="hljs-keyword">import java.util.ArrayList;
<span class="hljs-keyword">import java.util.Date;
<span class="hljs-keyword">import java.util.List;
<span class="hljs-keyword">import java.util.Map;

<span class="hljs-comment">/**

  • info:新闻

  • Created by shang on 16/8/22.
    */
    @Repository
    <span class="hljs-keyword">public <span class="hljs-class"><span class="hljs-keyword">class <span class="hljs-title">NewsPipeline <span class="hljs-keyword"><span class="hljs-keyword">implements <span class="hljs-type">Pipeline {

    @Autowired
    protected NewsRepository <span class="hljs-keyword">new<span class="hljs-type">sRepository;

    @Override
    <span class="hljs-keyword">public void process(ResultItems resultItems,Task task) {
    <span class="hljs-keyword">for (Map.Entry<<span class="hljs-keyword">String,Object> entry : <span class="hljs-type">resultItems.getAll().entrySet()) {
    <span class="hljs-keyword">if (entry.getKey().contains(<span class="hljs-string">"news")) {
    News <span class="hljs-keyword">new<span class="hljs-type">s=(News) entry.getValue();
    Specification specification=<span class="hljs-keyword">new <span class="hljs-type">Specification() {
    @Override
    <span class="hljs-keyword">public Predicate toPredicate(Root root,CriteriaQuery<?> criteriaQuery,CriteriaBuilder criteriaBuilder) {
    <span class="hljs-keyword">return criteriaBuilder.and(criteriaBuilder.equal(root.<span class="hljs-keyword">get(<span class="hljs-string">"link"),<span class="hljs-keyword">new<span class="hljs-type">s.getLink()));
    }
    };
    <span class="hljs-keyword">if (<span class="hljs-keyword">new<span class="hljs-type">sRepository.findOne(specification) == <span class="hljs-literal">null) {<span class="hljs-comment">//检查链接是否已存在
    <span class="hljs-keyword">new<span class="hljs-type">s.setAuthor(<span class="hljs-string">"水花");
    <span class="hljs-keyword">new<span class="hljs-type">s.setTypeId(<span class="hljs-number">1);
    <span class="hljs-keyword">new<span class="hljs-type">s.setSort(<span class="hljs-number">1);
    <span class="hljs-keyword">new<span class="hljs-type">s.setStatus(<span class="hljs-number">1);
    <span class="hljs-keyword">new<span class="hljs-type">s.setExplicitLink(<span class="hljs-literal">true);
    <span class="hljs-keyword">new<span class="hljs-type">s.setCreateDate(<span class="hljs-keyword">new <span class="hljs-type">Date());
    <span class="hljs-keyword">new<span class="hljs-type">s.setModifyDate(<span class="hljs-keyword">new <span class="hljs-type">Date());
    <span class="hljs-keyword">new<span class="hljs-type">sRepository.save(<span class="hljs-keyword">new<span class="hljs-type">s);
    }
    }

     }

    }
    }

2.4 定时任务模块Scheduled

使用spring boot自带的定时任务注解@Scheduled(cron = "0 0 0/2 * * ? "),每天从0天开始,每两个小时执行一次爬取任务,在定时任务里调取webmagic的爬取模块Processor。代码如下:


<span class="hljs-keyword">import com.shang.spray.common.processor.DevelopersProcessor;
<span class="hljs-keyword">import com.shang.spray.common.processor.JianShuProcessor;
<span class="hljs-keyword">import com.shang.spray.common.processor.ZhiHuProcessor;
<span class="hljs-keyword">import com.shang.spray.entity.Config;
<span class="hljs-keyword">import com.shang.spray.pipeline.NewsPipeline;
<span class="hljs-keyword">import com.shang.spray.service.ConfigService;
<span class="hljs-keyword">import org.springframework.beans.factory.annotation.Autowired;
<span class="hljs-keyword">import org.springframework.data.jpa.domain.Specification;
<span class="hljs-keyword">import org.springframework.scheduling.annotation.Scheduled;
<span class="hljs-keyword">import org.springframework.stereotype.Component;
<span class="hljs-keyword">import us.codecraft.webmagic.Spider;

<span class="hljs-keyword">import javax.persistence.criteria.CriteriaBuilder;
<span class="hljs-keyword">import javax.persistence.criteria.CriteriaQuery;
<span class="hljs-keyword">import javax.persistence.criteria.Predicate;
<span class="hljs-keyword">import javax.persistence.criteria.Root;

<span class="hljs-comment">/**

  • info:新闻定时任务

  • Created by shang on 16/8/22.
    */
    @<span class="hljs-type">Component
    <span class="hljs-keyword">public <span class="hljs-class"><span class="hljs-keyword">class <span class="hljs-title">NewsScheduled {
    @<span class="hljs-type">Autowired
    <span class="hljs-keyword">private <span class="hljs-type">NewsPipeline newsPipeline;

    <span class="hljs-comment">/**

    • 简书
      /
      @<span class="hljs-type">Scheduled(cron = <span class="hljs-string">"0 0 0/2
      * ? ")<span class="hljs-comment">//从0点开始,每2个小时执行一次
      <span class="hljs-keyword">public void jianShuScheduled() {
      <span class="hljs-type">System.out.<span class="hljs-built_in">println(<span class="hljs-string">"----开始执行简书定时任务");
      <span class="hljs-type">Spider spider = <span class="hljs-type">Spider.create(new <span class="hljs-type">JianShuProcessor());
      spider.addUrl(<span class="hljs-string">"http://www.jianshu.com");
      spider.addPipeline(newsPipeline);
      spider.thread(<span class="hljs-number">5);
      spider.setExitWhenComplete(<span class="hljs-literal">true);
      spider.start();
      spider.stop();
      }

}

2.5 spring boot启用定时任务

在spring boot的Application里启用定时任务注解,@EnableScheduling。代码如下:

<span class="hljs-keyword">import org.springframework.boot.<span class="hljs-type">SpringApplication;
<span class="hljs-keyword">import org.springframework.boot.autoconfigure.<span class="hljs-type">EnableAutoConfiguration;
<span class="hljs-keyword">import org.springframework.boot.autoconfigure.<span class="hljs-type">SpringBootApplication;
<span class="hljs-keyword">import org.springframework.boot.builder.<span class="hljs-type">SpringApplicationBuilder;
<span class="hljs-keyword">import org.springframework.boot.context.web.<span class="hljs-type">SpringBootServletInitializer;
<span class="hljs-keyword">import org.springframework.context.annotation.<span class="hljs-type">ComponentScan;
<span class="hljs-keyword">import org.springframework.context.annotation.<span class="hljs-type">Configuration;
<span class="hljs-keyword">import org.springframework.scheduling.annotation.<span class="hljs-type">EnableScheduling;

<span class="hljs-comment">/**

  • info:

  • Created by shang on 16/7/8.
    */
    <span class="hljs-meta">@Configuration
    <span class="hljs-meta">@EnableAutoConfiguration
    <span class="hljs-meta">@ComponentScan
    <span class="hljs-meta">@SpringBootApplication
    <span class="hljs-meta">@EnableScheduling
    public <span class="hljs-class"><span class="hljs-keyword">class <span class="hljs-title">SprayApplication <span class="hljs-keyword">extends <span class="hljs-title">SpringBootServletInitializer{

    <span class="hljs-meta">@Override
    <span class="hljs-keyword">protected <span class="hljs-type">SpringApplicationBuilder configure(<span class="hljs-type">SpringApplicationBuilder application) {
    <span class="hljs-keyword">return application.sources(<span class="hljs-type">SprayApplication.<span class="hljs-keyword">class);
    }

    public static void main(<span class="hljs-type">String[] args) <span class="hljs-keyword">throws <span class="hljs-type">Exception {
    <span class="hljs-type">SpringApplication.run(<span class="hljs-type">SprayApplication.<span class="hljs-keyword">class,args);
    }
    }

3. 结束语

使用webmagic是我在水花一现项目中爬取网站数据时使用的的爬虫框架,在综合比较的其他几个爬虫框架后,选择了这个框架,这个框架比较简单易学,且功能强大,我这里只使用了基本的功能,还有许多强大的功能都没有使用。有兴趣的可以去看看官方文档!

(编辑:李大同)

【声明】本站内容均来自网络,其相关言论仅代表作者个人观点,不代表本站立场。若无意侵犯到您的权利,请及时与联系站长删除相关内容!

    推荐文章
      热点阅读