使用总览

发布时间：2020-12-16 23:58:18 所属栏目：Python 来源：网络整理

导读：网络爬虫（又被称为网页蜘蛛，网络机器人，在FOAF社区中间，更经常的称为网页追逐者），是一种按照一定的规则，自动地抓取万维网信息的程序或者脚本。爬虫的本质就是一段自动抓取互联网信息的程序，从网络获取感兴趣的信息，抓取对于我们有价值的信息，爬虫

网络爬虫（又被称为网页蜘蛛，网络机器人，在FOAF社区中间，更经常的称为网页追逐者），是一种按照一定的规则，自动地抓取万维网信息的程序或者脚本。

爬虫的本质就是一段自动抓取互联网信息的程序，从网络获取感兴趣的信息，抓取对于我们有价值的信息，爬虫技术是大数据和云计算的基础。

爬虫的实现可认为是模拟浏览据交互，伪造HTTP请求。

使用总览

网页爬取库：

内容解析库：

查看网站爬虫协议

requests库基本使用

安装：?

sudo pip3 install requests

使用介绍：

requests res=requests.get() (res.url) res=requests.post() filex={:(,)} res=requests.post(urlx,files= (res.cookies) (res.cookies[]) coo={:,: res=ss.post(,cookies= ss= res=ss.post( res=ss.post() res=requests.post(,timeout=0.2) headx={: ,: ,: res=requests.get(,headers= (res.request.headers) (res.status_code) (res.raise_for_status()) (res.encoding) res.encoding= (res.headers) (res.headers[]) (res.text) jsontt1=res.json() (jsontt1.keys()) (jsontt1[]) res=requests.get(,timeout=5) f=open(,) f.write(res.content) f.close() res=requests.get(,stream= res.raw.read(10) res=requests.get(,stream= rxx=res.raw.read(1) f=open(, rxx: f.write(rxx) rxx = res.raw.read(1) f.close();

re库（正则表达式）基本使用

安装：

基本介绍：

.】、【】、【?】、【^】、【$】、【*】、【+】、【}】、【{】、【[】、【]】、【|】、【（】、【)】?

^】为取反，写在中括号内开头处，表示除了括号里的所有字符都可以

r’t(el)*’表示el可出现任意次

使用介绍：

re str=r zstr=re.findall(,str); (zstr); str=r zstr=re.findall( (zstr); str=r re_job=re.compile(,re.I|re.X) zstr=re_job.findall(str); (zstr); str=r zstr=re.match(,str); (zstr); ，未匹配到则返回None ( ( str=r zstr=re.search(,str); (zstr); ，未匹配到则返回None ( ( str=r zstr=re.sub(,,str); (zstr); str=r zstr=re.split(r,str); (zstr);

BeautifulSoup库基本使用

sudo pip3 install beautifulsoup4

bs4 BeautifulSoup requests res=requests.get( res.encoding= be=BeautifulSoup(res.text,) (be.original_encoding) (be.prettify()) (be.input) (be.form.input) (be.form.encode()) (be.input.parent.parent) (be.input.previous_sibling) (be.input.next_sibling) (be.img) picture= (picture.get()) (be.img[]) (be.title) 东小东页 (be.title.text) (be.title.string) (be.find_all(class_=,limit=2)) (be.find_all()) be.find_all(id=) (be.find_all(type=True)) (be.find_all(class_=)) (be.find_all(src=re.compile(r))) (be.find_all()[0][]) re inx be.find_all(re.compile(r)): inx be.find_all([,]): (inx.get()) (be.find(type=)) (be.find()) (be.find(,type=)) (be.find(text=).parent)东小东 (be.find_all(attrs={:})) be=BeautifulSoup(open(,))

（编辑：李大同）

【声明】本站内容均来自网络，其相关言论仅代表作者个人观点，不代表本站立场。若无意侵犯到您的权利，请及时与联系站长删除相关内容!