python – 跟踪网站链接的重复过程(BeautifulSoup)

发布时间：2020-12-20 12:32:51 所属栏目：Python 来源：网络整理

导读：我正在使用 Python编写代码以使用Beautiful soup获取URL中的所有’a’标记,然后我使用位置3处的链接,然后我应该关注该链接,我将重复此过程大约18次.我包含了下面的代码,该代码重复了两次.我不能在循环中重复相同的过程18次.任何帮助将不胜感激. import reimp

我正在使用 Python编写代码以使用Beautiful soup获取URL中的所有’a’标记,然后我使用位置3处的链接,然后我应该关注该链接,我将重复此过程大约18次.我包含了下面的代码,该代码重复了两次.我不能在循环中重复相同的过程18次.任何帮助将不胜感激.

import re
import urllib

from BeautifulSoup import *
htm1= urllib.urlopen('https://pr4e.dr-chuck.com/tsugi/mod/python-data/data/known_by_Fikret.html ').read()
soup =BeautifulSoup(htm1)
tags = soup('a')
list1=list()
for tag in tags:
    x = tag.get('href',None)
    list1.append(x)

M= list1[2]

htm2= urllib.urlopen(M).read()
soup =BeautifulSoup(htm2)
tags1 = soup('a')
list2=list()
for tag1 in tags1:
    x2 = tag1.get('href',None)
    list2.append(x2)

y= list2[2]
print y

好的,我刚刚编写了这段代码,它正在运行,但我在结果中得到了相同的4个链接.看起来循环中有问题(请注意：我正在尝试循环4次)

import re
import urllib
from BeautifulSoup import *
list1=list()
url = 'https://pr4e.dr-chuck.com/tsugi/mod/python-data/data/known_by_Fikret.html'

for i in range (4):  # repeat 4 times
    htm2= urllib.urlopen(url).read()
    soup1=BeautifulSoup(htm2)
    tags1= soup1('a')
    for tag1 in tags1:
        x2 = tag1.get('href',None)
        list1.append(x2)
    y= list1[2]
    if len(x2) < 3:  # no 3rd link
        break  # exit the loop
    else:
        url=y             
    print y

解决方法

I can’t come about a way to repeat the same process 18 times in a loop.

要在Python中重复18次,你可以在范围(18)循环中使用_：

#!/usr/bin/env python2
from urllib2 import urlopen
from urlparse import urljoin
from bs4 import BeautifulSoup # $pip install beautifulsoup4

url = 'http://example.com'
for _ in range(18):  # repeat 18 times
    soup = BeautifulSoup(urlopen(url))
    a = soup.find_all('a',href=True)  # all <a href> links
    if len(a) < 3:  # no 3rd link
        break  # exit the loop
    url = urljoin(url,a[2]['href'])  # 3rd link,note: ignore <base href>

（编辑：李大同）

【声明】本站内容均来自网络，其相关言论仅代表作者个人观点，不代表本站立场。若无意侵犯到您的权利，请及时与联系站长删除相关内容!