python – 使用gevent和请求异步模块的ImportError

发布时间：2020-12-20 13:27:41 所属栏目：Python 来源：网络整理

导读：我正在写一个简单的脚本：加载大量的URL 使用requests’ async模块获取每个URL的内容以发出并发HTTP请求使用lxml解析页面内容,以检查链接是否在页面中如果页面上存在链接,请在ZODB数据库中保存有关该页面的一些信息当我用4或5个URL测试脚本时效果很好,当

我正在写一个简单的脚本：

>加载大量的URL
>使用requests’ async模块获取每个URL的内容以发出并发HTTP请求
>使用lxml解析页面内容,以检查链接是否在页面中
>如果页面上存在链接,请在ZODB数据库中保存有关该页面的一些信息

当我用4或5个URL测试脚本时效果很好,当脚本结束时我只有以下消息：

Exception KeyError: KeyError(45989520,) in <module 'threading' from '/usr/lib/python2.7/threading.pyc'> ignored

但是当我尝试检查大约24000个URL时,它会在列表末尾(当剩下大约400个URL要检查时)失败,并出现以下错误：

Traceback (most recent call last):
  File "check.py",line 95,in <module>
  File "/home/alex/code/.virtualenvs/linka/local/lib/python2.7/site-packages/requests/async.py",line 83,in map
  File "/home/alex/code/.virtualenvs/linka/local/lib/python2.7/site-packages/gevent-1.0b2-py2.7-linux-x86_64.egg/gevent/greenlet.py",line 405,in joinall
ImportError: No module named queue
Exception KeyError: KeyError(45989520,) in <module 'threading' from '/usr/lib/python2.7/threading.pyc'> ignored

我尝试了pypi版本的gevent版本以及从gevent repository下载并安装最新版本(1.0b2).

我无法理解为什么会发生这种情况,以及为什么只有在检查一堆网址时才会发生这种情况.有什么建议？

这是整个脚本：

from requests import async,defaults
from lxml import html
from urlparse import urlsplit
from gevent import monkey
from BeautifulSoup import UnicodeDammit
from ZODB.FileStorage import FileStorage
from ZODB.DB import DB
import transaction
import persistent
import random

storage = FileStorage('Data.fs')
db = DB(storage)
connection = db.open()
root = connection.root()
monkey.patch_all()
defaults.defaults['base_headers']['User-Agent'] = "Mozilla/5.0 (Windows NT 5.1; rv:11.0) Gecko/20100101 Firefox/11.0"
defaults.defaults['max_retries'] = 10


def save_data(source,target,anchor):
    root[source] = persistent.mapping.PersistentMapping(dict(target=target,anchor=anchor))
    transaction.commit()


def decode_html(html_string):
    converted = UnicodeDammit(html_string,isHTML=True)
    if not converted.unicode:
        raise UnicodeDecodeError(
            "Failed to detect encoding,tried [%s]",','.join(converted.triedEncodings))
    # print converted.originalEncoding
    return converted.unicode


def find_link(html_doc,url):
    decoded = decode_html(html_doc)
    doc = html.document_fromstring(decoded.encode('utf-8'))
    for element,attribute,link,pos in doc.iterlinks():
        if attribute == "href" and link.startswith('http'):
            netloc = urlsplit(link).netloc
            if "example.org" in netloc:
                return (url,element.text_content().strip())
    else:
        return False


def check(response):
    if response.status_code == 200:
        html_doc = response.content
        result = find_link(html_doc,response.url)
        if result:
            source,anchor = result
            # print "Source: %s" % source
            # print "Target: %s" % target
            # print "Anchor: %s" % anchor
            # print
            save_data(source,anchor)
    global todo
    todo = todo -1
    print todo

def load_urls(fname):
    with open(fname) as fh:
        urls = set([url.strip() for url in fh.readlines()])
        urls = list(urls)
        random.shuffle(urls)
        return urls

if __name__ == "__main__":

    urls = load_urls('urls.txt')
    rs = []
    todo = len(urls)
    print "Ready to analyze %s pages" % len(urls)
    for url in urls:
        rs.append(async.get(url,hooks=dict(response=check),timeout=10.0))
    responses = async.map(rs,size=100)
    print "DONE."

解决方法

我不确定你问题的根源是什么,但为什么你的monkey.patch_all()不在文件的顶部？

你能试试吗？

from gevent import monkey; monkey.patch_all()

在主程序的顶部,看看它是否修复了什么？

（编辑：李大同）

【声明】本站内容均来自网络，其相关言论仅代表作者个人观点，不代表本站立场。若无意侵犯到您的权利，请及时与联系站长删除相关内容!