正则表达式基础

发布时间：2020-12-13 21:56:58 所属栏目：百科来源：网络整理

导读：正则表达式是学习python爬虫的必要条件，所以需要先做好准备打好表达式的基础，开始吧 # -*- coding: utf-8 -*- import reline = "helloworld123" # ^表示以什么开头 # .表示任意字符 # *表示一个字符可以重复任意零次或多次 # $符号表示结尾字符 regexStr =

正则表达式是学习python爬虫的必要条件，所以需要先做好准备打好表达式的基础，开始吧

# -*- coding: utf-8 -*-

import re

line = "helloworld123"
# ^表示以什么开头
# .表示任意字符
# *表示一个字符可以重复任意零次或多次
# $符号表示结尾字符

regexStr = "^h.*3$"

if re.match(regexStr,line):
    print("yes")  # 输出yes

? 表示非贪婪匹配

先来看看什么是贪婪匹配，正则表达式默认是从最后面开始匹配的

# 提取两个h中间的字符串
line = "heeeeeeeellohhaa"
regexStr = ".*(h.*h).*"  # 括号表示我们要提取的字符串
match_obj = re.match(regexStr,line)

if match_obj:
    print(match_obj.group(1))  # 输出hh，默认从最后面匹配

可以看到正则表达式，默认是从最右边匹配的，我们可以使用? 设置从最左边开始匹配，非贪婪匹配，及匹配到第一个符合条件的为止，否则默认匹配到最后一个

line = "heeeeeeeellohhaa"
regexStr = ".*?(h.*?h).*"  #? 设置从最左边开始匹配，非贪婪匹配，及匹配到第一个符合条件的为止，否则默认匹配到最后一个
match_obj = re.match(regexStr,line)
if match_obj:
    print(match_obj.group(1))  # 输出 heeeeeeeelloh

+ 表示至少出现一次

line = "heeeeeeeellohhhdhaa"
regexStr = ".*(h.+h).*"  # 括号表示我们要提取的字符串
match_obj = re.match(regexStr,line)
if match_obj:
    print(match_obj.group(1))  # hdh 可以看到默认从右边匹配的

{2} 表示字符串出现的个数

line = "heeeeeeeellohhhdhaa"
#regexStr = ".*(h.{3}h).*"  # {3} 表示中间出现的字符个数
#regexStr = ".*(h.{4,}h).*"  # {3,} 表示中间出现的字符个数是4次或4次以上
regexStr = ".*(h.{2,4}h).*"  # {2,4} 表示中间出现的字符个数最少2次最多4次
match_obj = re.match(regexStr,line)
if match_obj:
    print(match_obj.group(1))

| 表示或的关系

line = "helloworld123"
regexStr = "(helloworld123|hello)"
match_obj = re.match(regexStr,line)
if match_obj:
    print(match_obj.group(1))  # helloworld123

[] 表示匹配中括号内的任意字符

[] 表示匹配中括号内的任意字符，需要注意的是在[]里的所有正则表达式字符，都是没有特殊含义的

line = "18700987865"
regexStr = "(1[48357][0-9]{9})"
match_obj = re.match(regexStr,line)
if match_obj:
 print(match_obj.group(1)) # 18700987865

s 表示匹配一个空格

S表示匹配一个非空格的字符

line = "hello world"
regexStr = "(hellosworld)"
match_obj = re.match(regexStr,line)
if match_obj:
    print(match_obj.group(1))  # hello world

[u4E00-u9FA5] 表示匹配的字符是汉字

d 表示匹配的是数字

line = "xxx出生于1992年"
regexStr = ".*?(d+)年"
match_obj = re.match(regexStr,line)
if match_obj:
    print(match_obj.group(1))  # 1992

出生日期匹配

line = "xxx出生于1992年3月22日"
# line = "xxx出生于1992/3/22"
# line = "xxx出生于1992-3-22"
# line = "xxx出生于1992-03-22"
# line = "xxx出生于1992-03"
regexStr = ".*出生于(d{4}[年/-]d{1,2}([月/-]d{1,2}|[月/-]$|$))"
match_obj = re.match(regexStr,line)
if match_obj:
    print(match_obj.group(1))

（编辑：李大同）

【声明】本站内容均来自网络，其相关言论仅代表作者个人观点，不代表本站立场。若无意侵犯到您的权利，请及时与联系站长删除相关内容!