加入收藏 | 设为首页 | 会员中心 | 我要投稿 李大同 (https://www.lidatong.com.cn/)- 科技、建站、经验、云计算、5G、大数据,站长网!
当前位置: 首页 > 编程开发 > Python > 正文

python – 美丽的汤打开所有带有pid的网址

发布时间:2020-12-20 13:07:09 所属栏目:Python 来源:网络整理
导读:我试图通过pid打开所有链接,但有两种情况: 它打开所有网址(我的意思是即使是垃圾网址) def get_links(self): links = [] host = urlparse( self.url ).hostname scheme = urlparse( self.url ).scheme domain_link = scheme+'://'+host pattern = re.compil
我试图通过pid打开所有链接,但有两种情况:

>它打开所有网址(我的意思是即使是垃圾网址)

def get_links(self): 
    links = [] 
    host = urlparse( self.url ).hostname 
    scheme = urlparse( self.url ).scheme 
    domain_link = scheme+'://'+host 
    pattern = re.compile(r'(/pid/)')

    for a in self.soup.find_all(href=True):            
        href = a['href']
        if not href or len(href) <= 1:
            continue
        elif 'javascript:' in href.lower():
            continue
        elif 'forgotpassword' in href.lower():
            continue
        elif 'images' in href.lower():
            continue
        elif 'seller-account' in href.lower():
            continue
        elif 'review' in href.lower():
            continue
        else:
            href = href.strip()
        if href[0] == '/':
            href = (domain_link + href).strip()
        elif href[:4] == 'http':
            href = href.strip()
        elif href[0] != '/' and href[:4] != 'http':
            href = ( domain_link + '/' + href ).strip()                  
        if '#' in href:
            indx = href.index('#')
            href = href[:indx].strip()
        if href in links:
            continue

        links.append(self.re_encode(href))

    return links

>在这种情况下,它只是打开带有pid的URL,但在这种情况下,它不会跟随链接,仅限于主页.用pid打开几个链接后就崩溃了.

def get_links(self): 
    links = [] 
    host = urlparse( self.url ).hostname 
    scheme = urlparse( self.url ).scheme 
    domain_link = scheme+'://'+host 
    pattern = re.compile(r'(/pid/)')

    for a in self.soup.find_all(href=True):
        if pattern.search(a['href']) is not None:
           href = a['href']  
            if not href or len(href) <= 1:
                continue
            elif 'javascript:' in href.lower():
                continue
            elif 'forgotpassword' in href.lower():
                continue
            elif 'images' in href.lower():
                continue
            elif 'seller-account' in href.lower():
                continue
            elif 'review' in href.lower():
                continue
            else:
                href= href.strip()
            if href[0] == '/':
                href = (domain_link + href).strip()
            elif href[:4] == 'http':
                href = href.strip()
            elif href[0] != '/' and href[:4] != 'http':
                href = ( domain_link + '/' + href ).strip()                  
            if '#' in href:
               indx = href.index('#')
               href = href[:indx].strip()
            if href in links:
               continue

            links.append(self.re_encode(href))

    return links

有人可以帮助获取所有链接甚至网址内的内部链接,最后只接受pid作为返回的链接.

解决方法

也许我错过了一些东西但你为什么不在from而不是正则表达式中输入if语句?所以它看起来像这样:

def get_links(self): 
    links = [] 
    host = urlparse( self.url ).hostname 
    scheme = urlparse( self.url ).scheme 
    domain_link = scheme+'://'+host 

    for a in self.soup.find_all(href=True):            
        href = a['href']
        if not href or len(href) <= 1:
            continue
        if href.lower().find("/pid/") != -1:
            if 'javascript:' in href.lower():
                continue
            elif 'forgotpassword' in href.lower():
                continue
            elif 'images' in href.lower():
                continue
            elif 'seller-account' in href.lower():
                continue
            elif 'review' in href.lower():
                continue

            if href[0] == '/':
                href = (domain_link + href).strip()
            elif href[:4] == 'http':
                href = href.strip()
            elif href[0] != '/' and href[:4] != 'http':
                href = ( domain_link + '/' + href ).strip()   

            if '#' in href:
                indx = href.index('#')
                href = href[:indx].strip()

            if href in links:
                continue

            links.append(self.re_encode(href))

    return links

此外,我删除了以下行,因为我相信否则您的代码将永远不会到达较低区域,因为您继续执行所有操作.

else:
        continue

(编辑:李大同)

【声明】本站内容均来自网络,其相关言论仅代表作者个人观点,不代表本站立场。若无意侵犯到您的权利,请及时与联系站长删除相关内容!

    推荐文章
      热点阅读