每日迅雷会员爬虫

发布时间：2020-12-17 17:14:34 所属栏目：Python 来源：网络整理

导读：今天PHP站长网 52php.cn把收集自互联网的代码分享给大家，仅供参考。 #coding=utf8import urllib2import codecsimport reimport timefrom lxml import etreeurl1 = 'http://521xunlei.com/portal.php'path1 = '//*[@id="p

以下代码由PHP站长网 52php.cn收集自互联网

现在PHP站长网小编把它分享给大家，仅供参考

#coding=utf8
import urllib2
import codecs
import re
import time
from lxml import etree

url1  = 'http://521xunlei.com/portal.php'
path1 = '//*[@id="portal_block_62_content"]/div/ul/li[1]/a/@href'
path3 = '//*[@class="t_f"]/font/text()'

def geturlinfo(url,path,x):
	request  = urllib2.Request(url)
	response = urllib2.urlopen(request)
	result 	 = response.read()
	restree	 = etree.HTML(result)
	nodes 	 = restree.xpath(path)
	if x == '1':
		return nodes[0]
	else:
		i=0
		open('thunder.txt','w').write('')
		for node in nodes:
			if re.search(':',node):
				INFO = str(i)+': '+node.replace('rn','')
				print INFO
				open('thunder.txt','a').write(INFO.encode('utf8')+'n')
				i+=1

if __name__ == '__main__':
	while True:
		print '===================start===================n'
		url2 = 'http://'+url1.replace('http://','').split('/')[0]+'/'+geturlinfo(url1,path1,'1')
		print 'GET From: '+url2
		geturlinfo(url2,path3,'0')
		time.sleep(24*3600)

		#starts-with(@id,"test") id已test开头的 

		#首先获取对应div 再次xpath string(.) 组合

以上内容由PHP站长网【52php.cn】收集整理供大家参考研究

如果以上内容对您有帮助，欢迎收藏、点赞、推荐、分享。

（编辑：李大同）

【声明】本站内容均来自网络，其相关言论仅代表作者个人观点，不代表本站立场。若无意侵犯到您的权利，请及时与联系站长删除相关内容!