加入收藏 | 设为首页 | 会员中心 | 我要投稿 李大同 (https://www.lidatong.com.cn/)- 科技、建站、经验、云计算、5G、大数据,站长网!
当前位置: 首页 > 大数据 > 正文

Go实战--golang中的JQUERY(PuerkitoBio/goquery、从html中获取链

发布时间:2020-12-16 09:43:18 所属栏目:大数据 来源:网络整理
导读:生命不止,继续 go go go !!! jQuery应该说是家喻户晓。 jQuery is a fast,small,and feature-rich JavaScript library. It makes things like HTML document traversal and manipulation,event handling,animation,and Ajax much simpler with an easy-to-u

生命不止,继续 go go go !!!
jQuery应该说是家喻户晓。

jQuery is a fast,small,and feature-rich JavaScript library. It makes things like HTML document traversal and manipulation,event handling,animation,and Ajax much simpler with an easy-to-use API that works across a multitude of browsers. With a combination of versatility and extensibility,jQuery has changed the way that millions of people write JavaScript.

jQuery 是一个 JavaScript 函数库。
jQuery 库包含以下特性:
HTML 元素选取
HTML 元素操作
CSS 操作
HTML 事件函数
JavaScript 特效和动画
HTML DOM 遍历和修改
AJAX
Utilities

在golang的世界中,
github.com/PuerkitoBio/goquery 这个库就实现了类似 jQuery 的功能,让我们能方便的使用 Go 语言操作 HTML 文档。

记住,如果使用golang做爬虫方面的事儿,你可能会用到goquery啊!

参考:
http://blog.studygolang.com/2015/04/go-jquery-goquery/

PuerkitoBio/goquery

github地址:
https://github.com/PuerkitoBio/goquery

Star: 4833

描述:
A little like that j-thing,only in Go.

获取:
go get github.com/PuerkitoBio/goquery

创建 Document 对象
goquery 暴露了两个结构体:Document 和 Selection.
Document 表示一个 HTML 文档,Selection 用于像 jQuery 一样操作,支持链式调用。goquery 需要指定一个 HTML 文档才能继续后续的操作。

查找到指定节点
Selection 有一系列类似 jQuery 的方法,Document 结构体内嵌了 *Selection,因此也能直接调用这些方法。主要的方法是 Selection.Find(selector string),传入一个选择器,返回一个新的,匹配到的 *Selection,所以能够链式调用。

属性操作
经常需要获取一个标签的内容和某些属性值,使用 goquery 可以很容易做到

官方例子

package main

import (
  "fmt"
  "log"

  "github.com/PuerkitoBio/goquery"
)

func ExampleScrape() {
  doc,err := goquery.NewDocument("http://metalsucks.net")
  if err != nil {
    log.Fatal(err)
  }

  // Find the review items
  doc.Find(".sidebar-reviews article .content-block").Each(func(i int,s *goquery.Selection) {
    // For each item found,get the band and title
    band := s.Find("a").Text()
    title := s.Find("i").Text()
    fmt.Printf("Review %d: %s - %sn",i,band,title)
  })
}

func main() {
  ExampleScrape()
}

输出:

Review 0: Cavalera Conspiracy - Psychosis
Review 1: Cannibal Corpse - Red Before Black
Review 2: All Pigs Must Die - Hostage Animal
Review 3: Electric Wizard - Wizard Bloody Wizard
Review 4: Trivium - The Sin and the Sentence
import (
    "fmt"
    "log"

    "github.com/PuerkitoBio/goquery"
)

func linkScrape() {
    doc,err := goquery.NewDocument("http://jonathanmh.com")
    if err != nil {
        log.Fatal(err)
    }

    doc.Find("body a").Each(func(index int,item *goquery.Selection) {
        linkTag := item
        link,_ := linkTag.Attr("href")
        linkText := linkTag.Text()
        fmt.Printf("Link #%d: '%s' - '%s'n",index,linkText,link)
    })
}

func main() {
    linkScrape()
}

输出:

Link #0: 'Skip to content' - '#content'
Link #1: 'JonathanMH' - 'https://jonathanmh.com/'
Link #2: 'Blog' - 'https://jonathanmh.com/category/blog/'
Link #3: 'Hire Me' - 'https://jonathanmh.com/hire-me/'
Link #4: 'About' - 'https://jonathanmh.com/about/'
Link #5: 'twitter' - 'https://twitter.com/JonathanMH_com'
Link #6: 'rss feed' - 'http://jonathanmh.com/feed/'
Link #7: 'github' - 'https://github.com/JonathanMH'
Link #8: 'stackoverflow' - 'http://stackoverflow.com/users/896285/jonathan-m-hethey'
Link #9: 'instagram' - 'http://instagram.com/jonathanmh'
Link #10: 'facebook' - 'https://www.facebook.com/pages/JonathanMH/159526834122370'
Link #11: 'linkedin' - 'http://www.linkedin.com/in/jonathanmh'
Link #12: 'hire me' - '/hire-me'
Link #13: '' - 'https://twitter.com/JonathanMH_com'
Link #14: '' - 'https://www.facebook.com/JonathanMH-159526834122370/'
Link #15: '' - 'https://www.instagram.com/jonathanmh/'
Link #16: '' - 'https://github.com/jonathanmh/'
Link #17: 'Work every day like you just got fired' - 'https://jonathanmh.com/work-every-day-like-just-got-fired/'
Link #18: 'Vue.js API Client / Single Page App (SPA) Tutorial' - 'https://jonathanmh.com/vue-js-api-client-single-page-app-spa-tutorial/'
Link #19: 'Building a Simple Searchable API with Express (Backend)' - 'https://jonathanmh.com/building-a-simple-searchable-api-with-express-backend/'
Link #20: 'Music Monday: Doom Soundtrack' - 'https://jonathanmh.com/music-monday-doom-soundtrack/'
Link #21: 'Brick by Brick' - 'https://jonathanmh.com/brick-by-brick/'
Link #22: 'Taking Screenshots with Headless,The Chrome Debuggping Protocol (CDP) and Golang' - 'https://jonathanmh.com/taking-screenshots-headless-chrome-debuggping-protocol-cdp-golang/'
Link #23: 'Firefox has re-joined the Browser Wars' - 'https://jonathanmh.com/firefox-re-joined-browser-wars/'
Link #24: 'A Mastodon Review,is it the next Twitter / Facebook by the People?' - 'https://jonathanmh.com/mastodon-review-next-twitter-facebook-people/'
Link #25: 'Testing Coin Hive Crowd Source Monero Mining' - 'https://jonathanmh.com/testing-coin-hive-crowd-source-monero-mining/'
Link #26: 'Glass Half' - 'https://jonathanmh.com/glass-half/'
Link #27: 'read older posts' - '/blog/page/2/'
Link #28: 'twitter' - 'https://twitter.com/JonathanMH_com'
Link #29: 'rss feed' - 'http://jonathanmh.com/feed/'
Link #30: 'github' - 'https://github.com/JonathanMH'
Link #31: 'stackoverflow' - 'http://stackoverflow.com/users/896285/jonathan-m-hethey'
Link #32: 'instagram' - 'http://instagram.com/jonathanmh'
Link #33: 'facebook' - 'https://www.facebook.com/pages/JonathanMH/159526834122370'
Link #34: 'linkedin' - 'http://www.linkedin.com/in/jonathanmh'
Link #35: '.htaccess' - 'https://jonathanmh.com/tag/htaccess/'
Link #36: 'Adobe' - 'https://jonathanmh.com/tag/adobe/'
Link #37: 'Android' - 'https://jonathanmh.com/tag/android/'
Link #38: 'Arch Linux' - 'https://jonathanmh.com/tag/arch-linux/'
Link #39: 'atom' - 'https://jonathanmh.com/tag/atom/'
Link #40: 'bash' - 'https://jonathanmh.com/tag/bash/'
Link #41: 'blogging' - 'https://jonathanmh.com/tag/blogging/'
Link #42: 'Brackets' - 'https://jonathanmh.com/tag/brackets/'
Link #43: 'cigtrack' - 'https://jonathanmh.com/tag/cigtrack/'
Link #44: 'CodeIgniter' - 'https://jonathanmh.com/tag/codeigniter/'
Link #45: 'CSS' - 'https://jonathanmh.com/tag/css/'
Link #46: 'Digital Ocean' - 'https://jonathanmh.com/tag/digital-ocean/'
Link #47: 'express.js' - 'https://jonathanmh.com/tag/express-js/'
Link #48: 'facebook' - 'https://jonathanmh.com/tag/facebook/'
Link #49: 'ghost' - 'https://jonathanmh.com/tag/ghost/'
Link #50: 'git' - 'https://jonathanmh.com/tag/git/'
Link #51: 'github' - 'https://jonathanmh.com/tag/github/'
Link #52: 'gitlab' - 'https://jonathanmh.com/tag/gitlab/'
Link #53: 'go' - 'https://jonathanmh.com/tag/go/'
Link #54: 'golang' - 'https://jonathanmh.com/tag/golang/'
Link #55: 'Google' - 'https://jonathanmh.com/tag/google/'
Link #56: 'Gulp' - 'https://jonathanmh.com/tag/gulp/'
Link #57: 'gvim' - 'https://jonathanmh.com/tag/gvim/'
Link #58: 'JavaScript' - 'https://jonathanmh.com/tag/javascript/'
Link #59: 'kickstarter' - 'https://jonathanmh.com/tag/kickstarter/'
Link #60: 'Linux' - 'https://jonathanmh.com/tag/linux/'
Link #61: 'markdown' - 'https://jonathanmh.com/tag/markdown/'
Link #62: 'mindset' - 'https://jonathanmh.com/tag/mindset/'
Link #63: 'MVC' - 'https://jonathanmh.com/tag/mvc/'
Link #64: 'Nginx' - 'https://jonathanmh.com/tag/nginx/'
Link #65: 'node.js' - 'https://jonathanmh.com/tag/node-js/'
Link #66: 'npm' - 'https://jonathanmh.com/tag/npm/'
Link #67: 'PHP' - 'https://jonathanmh.com/tag/php/'
Link #68: 'plugin' - 'https://jonathanmh.com/tag/plugin/'
Link #69: 'Raspberry PI' - 'https://jonathanmh.com/tag/raspberry-pi/'
Link #70: 'SCSS' - 'https://jonathanmh.com/tag/scss/'
Link #71: 'social media' - 'https://jonathanmh.com/tag/social-media/'
Link #72: 'ssh' - 'https://jonathanmh.com/tag/ssh/'
Link #73: 'Terminal' - 'https://jonathanmh.com/tag/terminal/'
Link #74: 'toolbox' - 'https://jonathanmh.com/tag/toolbox/'
Link #75: 'UberWriter' - 'https://jonathanmh.com/tag/uberwriter/'
Link #76: 'Ubuntu' - 'https://jonathanmh.com/tag/ubuntu/'
Link #77: 'vim' - 'https://jonathanmh.com/tag/vim/'
Link #78: 'web crawling' - 'https://jonathanmh.com/tag/web-crawling/'
Link #79: 'WordPress' - 'https://jonathanmh.com/tag/wordpress/'
Link #80: 'Blog' - 'https://jonathanmh.com/category/blog/'
Link #81: 'Hire Me' - 'https://jonathanmh.com/hire-me/'
Link #82: 'About' - 'https://jonathanmh.com/about/'
Link #83: 'twitter' - 'https://twitter.com/JonathanMH_com'
Link #84: 'rss feed' - 'http://jonathanmh.com/feed/'
Link #85: 'github' - 'https://github.com/JonathanMH'
Link #86: 'stackoverflow' - 'http://stackoverflow.com/users/896285/jonathan-m-hethey'
Link #87: 'instagram' - 'http://instagram.com/jonathanmh'
Link #88: 'facebook' - 'https://www.facebook.com/pages/JonathanMH/159526834122370'
Link #89: 'linkedin' - 'http://www.linkedin.com/in/jonathanmh'
Link #90: 'JonathanMH' - 'https://jonathanmh.com/'
Link #91: 'Proudly powered by WordPress' - 'https://wordpress.org/'
package main

import (
    "os"
    "strings"
    "text/template"

    "github.com/PuerkitoBio/goquery"
)

const rstLink = "`{{.Text}} <{{.Href}}>`_n"

type htmlLink struct {
    Text string
    Href string
}

func main() {
    url := "https://www.baidu.com"

    doc,err := goquery.NewDocument(url)
    if err != nil {
        panic(err)
    }

    tmpl := template.Must(template.New("test").Parse(rstLink))
    doc.Find("a").Each(func(_ int,link *goquery.Selection) {
        text := strings.TrimSpace(link.Text())
        href,ok := link.Attr("href")
        if ok {
            tmpl.Execute(os.Stdout,&htmlLink{text,href})
        }
    })
}

输出:

` </>`_
`手写 <javascript:;>`_
`拼音 <javascript:;>`_
`关闭 <javascript:;>`_
`百度首页 </>`_
`设置 <javascript:;>`_
`登录 <https://passport.baidu.com/v2/?login&tpl=mn&u=http%3A%2F%2Fwww.baidu.com%2F>`_
`新闻 <http://news.baidu.com>`_
`hao123 <http://www.hao123.com>`_
`地图 <http://map.baidu.com>`_
`视频 <http://v.baidu.com>`_
`贴吧 <http://tieba.baidu.com>`_
`学术 <http://xueshu.baidu.com>`_
`登录 <https://passport.baidu.com/v2/?login&tpl=mn&u=http%3A%2F%2Fwww.baidu.com%2F>`_
`设置 <http://www.baidu.com/gaoji/preferences.html>`_
`更多产品 <http://www.baidu.com/more/>`_
`新闻 <http://news.baidu.com/ns?cl=2&rn=20&tn=news&word=>`_
`贴吧 <http://tieba.baidu.com/f?kw=&fr=wwwt>`_
`知道 <http://zhidao.baidu.com/q?ct=17&pn=0&tn=ikaslist&rn=10&word=&fr=wwwt>`_
`音乐 <http://music.baidu.com/search?fr=ps&ie=utf-8&key=>`_
`图片 <http://image.baidu.com/search/index?tn=baiduimage&ps=1&ct=201326592&lm=-1&cl=2&nc=1&ie=utf-8&word=>`_
`视频 <http://v.baidu.com/v?ct=301989888&rn=20&pn=0&db=0&s=25&ie=utf-8&word=>`_
`地图 <http://map.baidu.com/m?word=&fr=ps01000>`_
`文库 <http://wenku.baidu.com/search?word=&lm=0&od=0&ie=utf-8>`_
`更多? <//www.baidu.com/more/>`_
`把百度设为主页 <//www.baidu.com/cache/sethelp/help.html>`_
`关于百度 <http://home.baidu.com>`_
`About  Baidu <http://ir.baidu.com>`_
`百度推广 <http://e.baidu.com/?refer=888>`_
`使用百度前必读 <http://www.baidu.com/duty/>`_
`意见反馈 <http://jianyi.baidu.com/>`_
`京公网安备11000002000001号 <http://www.beian.gov.cn/portal/registerSystemInfo?recordcode=11000002000001>`_

(编辑:李大同)

【声明】本站内容均来自网络,其相关言论仅代表作者个人观点,不代表本站立场。若无意侵犯到您的权利,请及时与联系站长删除相关内容!

    推荐文章
      热点阅读