python – Web爬虫 – 以下链接

发布时间：2020-12-16 22:54:10 所属栏目：Python 来源：网络整理

导读：请多多包涵.我是Python的新手但有很多乐趣.我正在尝试编写一个网络爬虫代码,用于搜索丹麦最后一次公投的选举结果.我设法从主页面中提取所有相关链接.现在我希望Python遵循92个链接中的每一个,并从每个页面中收集9条信息.但我很困惑.希望你能给我一个提示.

请多多包涵.我是Python的新手 – 但有很多乐趣.我正在尝试编写一个网络爬虫代码,用于搜索丹麦最后一次公投的选举结果.我设法从主页面中提取所有相关链接.现在我希望Python遵循92个链接中的每一个,并从每个页面中收集9条信息.但我很困惑.希望你能给我一个提示.

这是我的代码：

import requests
import urllib2 
from bs4 import BeautifulSoup

# This is the original url http://www.kmdvalg.dk/

soup = BeautifulSoup(urllib2.urlopen('http://www.kmdvalg.dk/').read())

my_list = []
all_links = soup.find_all("a")

for link in all_links:
    link2 = link["href"]
    my_list.append(link2)

for i in my_list[1:93]:
    print i

# The output shows all the links that I would like to follow and gather information from. How do I do that?

最佳答案

一个简单的方法是遍历您的URL列表并分别解析它们：

for url in my_list:
    soup = BeautifulSoup(urllib2.urlopen(url).read())
    # then parse each page individually here

或者,您可以使用Futures显着加快速度.

from requests_futures.sessions import FuturesSession

def my_parse_function(html):
    """Use this function to parse each page"""
    soup = BeautifulSoup(html)
    all_paragraphs = soup.find_all('p')
    return all_paragraphs

session = FuturesSession(max_workers=5)
futures = [session.get(url) for url in my_list]

page_results = [my_parse_function(future.result()) for future in results]

（编辑：李大同）

【声明】本站内容均来自网络，其相关言论仅代表作者个人观点，不代表本站立场。若无意侵犯到您的权利，请及时与联系站长删除相关内容!