Python使用Mechanize模块编写爬虫的要点解析
mechanize是对urllib2的部分功能的替换,能够更好的模拟浏览器行为,在web访问控制方面做得更全面。结合beautifulsoup和re模块,可以有效的解析web页面,我比较喜欢这种方法。 #!/usr/bin/env python import sys,mechanize #Browser br = mechanize.Browser() #options br.set_handle_equiv(True) br.set_handle_gzip(True) br.set_handle_redirect(True) br.set_handle_referer(True) br.set_handle_robots(False) #Follows refresh 0 but not hangs on refresh > 0 br.set_handle_refresh(mechanize._http.HTTPRefreshProcessor(),max_time=1) #debugging? br.set_debug_http(True) br.set_debug_redirects(True) br.set_debug_responses(True) #User-Agent (this is cheating,ok?) br.addheaders = [('User-agent','Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.1) Gecko/2008071615 Fedora/3.0.1-1.fc9 Firefox/3.0.1')]
r = br.open(sys.argv[1]) html = r.read() print html print br.response().read() print br.title() print r.info()
for f in br.forms(): print f br.select_form(nr=0) 谷歌查询football br.form['q'] = 'football' br.submit() print br.response().read()
br.form['wd'] = 'football' br.submit() print br.response().read() 回退(Back) # Back br.back() print br.geturl() 3.http基本认证 br.add_password('http://xxx.com','username','password') br.open('http://xxx.com') 4.form认证 br.select_form(nr = 0) br['email'] = username br['password'] = password resp = self.br.submit() 5.cookie支持 #!/usr/bin/env python import mechanize,cookielib br = mechanize.Browser() cj = cookielib.LWPCookieJar() br.set_cookiejar() 6.proxy设置 #Proxy br.set_proxies({"http":"proxy.com:8888"}) br.add_proxy_password("username","password") #Proxy and usrer/password br.set_proxies({"http":"username:password@proxy.com:8888"}) 7.关于内存过高问题 在用mechanize写了一个爬虫脚本,想要去某网站爬取大概30万张图片。
class NoHistory(object): def add(self,*a,**k): pass def clear(self): pass b = mechanize.Browser(history=NoHistory()) (编辑:李大同) 【声明】本站内容均来自网络,其相关言论仅代表作者个人观点,不代表本站立场。若无意侵犯到您的权利,请及时与联系站长删除相关内容! |