python多线程多队列（BeautifulSoup网络爬虫）

发布时间：2020-12-17 17:13:06 所属栏目：Python 来源：网络整理

导读：今天PHP站长网 52php.cn把收集自互联网的代码分享给大家，仅供参考。 import Queue import threading import urllib2 import time from BeautifulSoup import BeautifulSoup hosts = ["http://yahoo.com","http://taobao.

以下代码由PHP站长网 52php.cn收集自互联网

现在PHP站长网小编把它分享给大家，仅供参考

    import Queue  
    import threading  
    import urllib2  
    import time  
    from BeautifulSoup import BeautifulSoup  
      
    hosts = ["http://yahoo.com","http://taobao.com","http://apple.com","http://ibm.com","http://www.amazon.cn"]  
      
    queue = Queue.Queue()#存放网址的队列  
    out_queue = Queue.Queue()#存放网址页面的队列  
      
    class ThreadUrl(threading.Thread):  
        def __init__(self,queue,out_queue):  
            threading.Thread.__init__(self)  
            self.queue = queue  
            self.out_queue = out_queue  
      
        def run(self):  
            while True:  
                host = self.queue.get()  
                url = urllib2.urlopen(host)  
                chunk = url.read()  
                self.out_queue.put(chunk)#将hosts中的页面传给out_queue  
                self.queue.task_done()#传入一个相当于完成一个任务  
      
    class DatamineThread(threading.Thread):  
        def __init__(self,out_queue):  
            threading.Thread.__init__(self)  
            self.out_queue = out_queue  
      
        def run(self):  
            while True:  
                chunk = self.out_queue.get()  
                soup = BeautifulSoup(chunk)#从源代码中搜索title标签的内容  
                print soup.findAll(['title'])  
                self.out_queue.task_done()  
      
    start = time.time()  
    def main():  
        for i in range(5):  
            t = ThreadUrl(queue,out_queue)#线程任务就是将网址的源代码存放到out_queue队列中  
            t.setDaemon(True)#设置为守护线程  
            t.start()  
      
        #将网址都存放到queue队列中  
        for host in hosts:  
            queue.put(host)  
      
        for i in range(5):  
            dt = DatamineThread(out_queue)#线程任务就是从源代码中解析出<title>标签内的内容  
            dt.setDaemon(True)  
            dt.start()  
      
        queue.join()#线程依次执行，主线程最后执行  
        out_queue.join()  
      
    main()  
    print "Total time :%s"%(time.time()-start)

以上内容由PHP站长网【52php.cn】收集整理供大家参考研究

如果以上内容对您有帮助，欢迎收藏、点赞、推荐、分享。

（编辑：李大同）

【声明】本站内容均来自网络，其相关言论仅代表作者个人观点，不代表本站立场。若无意侵犯到您的权利，请及时与联系站长删除相关内容!