在python scraper脚本中解析facebook mobile时出现lxml错误“IOE
发布时间:2020-12-20 12:27:58 所属栏目:Python 来源:网络整理
导读:我使用 Logging into facebook with python帖子修改后的脚本: #!/usr/bin/python2 -u# -*- coding: utf8 -*-facebook_email = "YOUR_MAIL@DOMAIN.TLD"facebook_passwd = "YOUR_PASSWORD"import cookielib,urllib2,urllib,time,sysfrom lxml import etreejar
我使用
Logging into facebook with python帖子修改后的脚本:
#!/usr/bin/python2 -u # -*- coding: utf8 -*- facebook_email = "YOUR_MAIL@DOMAIN.TLD" facebook_passwd = "YOUR_PASSWORD" import cookielib,urllib2,urllib,time,sys from lxml import etree jar = cookielib.CookieJar() cookie = urllib2.HTTPCookieProcessor(jar) opener = urllib2.build_opener(cookie) headers = { "User-Agent" : "Mozilla/5.0 (iPhone; U; CPU iPhone OS 4_0 like Mac OS X; en-us) AppleWebKit/532.9 (KHTML,like Gecko) Version/4.0.5 Mobile/8A293 Safari/6531.22.7","Accept" : "text/xml,application/xml,application/xhtml+xml,text/html;q=0.9,text/plain;q=0.8,text/png,*/*;q=0.5","Accept-Language" : "en-us,en;q=0.5","Accept-Charset" : "utf-8","Content-type": "application/x-www-form-urlencoded","Host": "m.facebook.com" } try: params = urllib.urlencode({'email':facebook_email,'pass':facebook_passwd,'login':'Log+In'}) req = urllib2.Request('http://m.facebook.com/login.php?m=m&refsrc=m.facebook.com%2F',params,headers) res = opener.open(req) html = res.read() except urllib2.HTTPError,e: print e.msg except urllib2.URLError,e: print e.reason[1] def fetch(url): req = urllib2.Request(url,None,headers) res = opener.open(req) return res.read() body = unicode(fetch("http://www.facebook.com/photo.php?fbid=404284859586659&set=a.355112834503862.104278.354259211255891&type=1"),errors='ignore') tree = etree.parse(body) r = tree.xpath('/see_prev') print r.text 当我执行代码时,出现问题: $./facebook_fetch_coms.py Traceback (most recent call last): File "./facebook_fetch_coms_classic_test.py",line 42,in <module> tree = etree.parse(body) File "lxml.etree.pyx",line 2957,in lxml.etree.parse (src/lxml/lxml.etree.c:56230) File "parser.pxi",line 1533,in lxml.etree._parseDocument (src/lxml/lxml.etree.c:82313) File "parser.pxi",line 1562,in lxml.etree._parseDocumentFromURL (src/lxml/lxml.etree.c:82606) File "parser.pxi",line 1462,in lxml.etree._parseDocFromFile (src/lxml/lxml.etree.c:81645) File "parser.pxi",line 1002,in lxml.etree._BaseParser._parseDocFromFile (src/lxml/lxml.etree.c:78554) File "parser.pxi",line 569,in lxml.etree._ParserContext._handleParseResultDoc (src/lxml/lxml.etree.c:74498) File "parser.pxi",line 650,in lxml.etree._handleParseResult (src/lxml/lxml.etree.c:75389) File "parser.pxi",line 588,in lxml.etree._raiseParseError (src/lxml/lxml.etree.c:74691) IOError: Error reading file '<?xml version="1.0" encoding="utf-8"?> <!DOCTYPE html PUBLIC "-//WAPFORUM//DTD XHTML Mobile 1.0//EN" "http://www.wapforum.org/DTD/xhtml-mobile10.dtd"> <html xmlns="http://www.w3.org/1999/xhtml"><head><title>Facebook</title><meta name="description" content="Facebook helps you connect and share with the people in your life." 目标是首先使用lxml获取id = see_prev的链接,然后使用while循环打开所有注释,最后获取文件中的所有消息.任何帮助将非常感谢! 编辑: 解决方法
这是你的问题:
tree = etree.parse(body) documentation说“source是包含XML数据的文件名或文件对象.”您提供了一个字符串,因此lxml将HTTP响应正文的文本作为您要打开的文件的名称.没有这样的文件,所以你得到一个IOError. 您得到的错误消息甚至会显示“读取文件时出错”,然后将您的XML字符串作为其尝试读取的文件的名称,这是一个关于正在发生的事情的强大暗示. 您可能想要 (编辑:李大同) 【声明】本站内容均来自网络,其相关言论仅代表作者个人观点,不代表本站立场。若无意侵犯到您的权利,请及时与联系站长删除相关内容! |