【正则表达式】使用多行的正则表达式匹配多行的网页数据

发布时间：2020-12-14 00:40:37 所属栏目：百科来源：网络整理

导读：对于正则表达式的语法，这里不做详解。只是提一下学习正则表达式时，只需要了解元字符表示的意义、编译函数和编译标志、re模块包含的顶级方法和matchobject的实例方法即可。目标：从指定页面抓到的数据中提取目标数据，这里要提取的就是代理服务器的ip和端

对于正则表达式的语法，这里不做详解。只是提一下学习正则表达式时，只需要了解元字符表示的意义、编译函数和编译标志、re模块包含的顶级方法和matchobject的实例方法即可。

目标：

从指定页面抓到的数据中提取目标数据，这里要提取的就是代理服务器的ip和端口

注意：这里使用的是多行的正则表达式，当然可以使用re.X标志进行编译，但是由于html对空白字符要求不严格，所以经常出现页面对齐格式不良好，为了解决这一问题，相比其他正则表达式，改进之处在于对html页面的每一行都在其行头和行尾加上[s]*来匹配无效的空白字符。关于这一细节，还需提醒一点的是并不能直接拿页面的内容作为正则表达式，特别是要匹配的内容是多行的时候。

#!/usr/bin/env python
# -*- coding:utf-8 -*-

import urllib2
import re

'''
    http请求体的内容格式如下：
    <div class="proxylistitem" name="list_proxy_ip">
            <div style="float:left; display:block; width:630px;">
            <span class="tbBottomLine" style="width:140px;">
                66.104.77.20
            </span>
            <span class="tbBottomLine" style="width:50px;">
                    3128
            </span>
            <span class="tbBottomLine " style="width:70px;">
                高匿
            </span>
            <span class="tbBottomLine " style="width:70px;">
                美国
            </span>
            <span class="tbBottomLine " style="width:80px;">
                09月05日
            </span>
            <span class="tbBottomLine " style="width:80px;">
                2.70(61票)
            </span>
            <span class="tbBottomLine " style="width:60px;">
                2.70
            </span>
            <span class="tbBottomLine " style="width:30px;">
                10天
            </span>
            </div>
        </div>
    目标页面：
    http://www.proxy360.cn/Region/America
'''

def get_proxy_from_cnproxy():
    reStr = '<span class="tbBottomLine" style="width:140px;">[s]*
            [s]*(.+?)[s]*
            [s]*</span>[s]*
            [s]*<span class="tbBottomLine" style="width:50px;">[s]*
            [s]*(.+?)[s]*
            [s]*</span>'
    req_reObj = re.compile(reStr)
    target = r'http://www.proxy360.cn/Region/America'
    seq_page = urllib2.urlopen(target)
    seq_page_html = seq_page.read()
    proxy_address = req_reObj.findall(seq_page_html)
    for address in proxy_address:
        print address

（编辑：李大同）

【声明】本站内容均来自网络，其相关言论仅代表作者个人观点，不代表本站立场。若无意侵犯到您的权利，请及时与联系站长删除相关内容!