Python：正则表达式无法正常工作

发布时间：2020-12-20 11:49:09 所属栏目：Python 来源：网络整理

导读：我正在使用以下正则表达式,它假设找到字符串’U.S.A.’,但它只获得’A’,是否有人知道什么是错的？ #INPUTimport retext = 'That U.S.A. poster-print costs $12.40...'print re.findall(r'([A-Z].)+',text)#OUTPUT['A.'] 预期产出： ['U.S.A.'] 我正在关注

我正在使用以下正则表达式,它假设找到字符串’U.S.A.’,但它只获得’A’,是否有人知道什么是错的？

#INPUT
import re

text = 'That U.S.A. poster-print costs $12.40...'

print re.findall(r'([A-Z].)+',text)

#OUTPUT
['A.']

预期产出：

['U.S.A.']

我正在关注NLTK Book,第3.7章here,它有一套正则表达式,但它只是不起作用.我在Python 2.7和3.4中都尝试过它.

>>> text = 'That U.S.A. poster-print costs $12.40...'
>>> pattern = r'''(?x)    # set flag to allow verbose regexps
...     ([A-Z].)+        # abbreviations,e.g. U.S.A.
...   | w+(-w+)*        # words with optional internal hyphens
...   | $?d+(.d+)?%?  # currency and percentages,e.g. $12.40,82%
...   | ...            # ellipsis
...   | [][.,;"'?():-_`]  # these are separate tokens; includes ],[
... '''
>>> nltk.regexp_tokenize(text,pattern)
['That','U.S.A.','poster-print','costs','$12.40','...']

nltk.regexp_tokenize()与re.findall()的工作方式相同,我想我的python在某种程度上无法按预期识别正则表达式.上面列出的正则表达式输出：

[('','',''),('A.',('','-print','.40'),'')]

解决方法

可能,这与先前使用在v3.1中废除的nltk.internals.compile_regexp_to_noncapturing()编译正则表达式有关,参见 here)

>>> import nltk
>>> nltk.__version__
'3.0.5'
>>> pattern = r'''(?x)               # set flag to allow verbose regexps
...               ([A-Z].)+         # abbreviations,e.g. U.S.A.
...               | $?d+(.d+)?%? # numbers,incl. currency and percentages
...               | w+([-']w+)*    # words w/ optional internal hyphens/apostrophe
...               | [+/-@&*]        # special characters with meanings
...             '''
>>> 
>>> from nltk.tokenize.regexp import RegexpTokenizer
>>> tokeniser=RegexpTokenizer(pattern)
>>> line="My weight is about 68 kg,+/- 10 grams."
>>> tokeniser.tokenize(line)
['My','weight','is','about','68','kg','+','/','-','10','grams']

但它在NLTK v3.1中不起作用：

>>> import nltk
>>> nltk.__version__
'3.1'
>>> pattern = r'''(?x)               # set flag to allow verbose regexps
...               ([A-Z].)+         # abbreviations,incl. currency and percentages
...               | w+([-']w+)*    # words w/ optional internal hyphens/apostrophe
...               | [+/-@&*]        # special characters with meanings
...             '''
>>> from nltk.tokenize.regexp import RegexpTokenizer
>>> tokeniser=RegexpTokenizer(pattern)
>>> line="My weight is about 68 kg,+/- 10 grams."
>>> tokeniser.tokenize(line)
[('','')]

稍微修改一下你的正则表达式组的定义,你可以使用这个正则表达式在NLTK v3.1中使用相同的模式：

pattern = r"""(?x)                   # set flag to allow verbose regexps
              (?:[A-Z].)+           # abbreviations,e.g. U.S.A.
              |d+(?:.d+)?%?       # numbers,incl. currency and percentages
              |w+(?:[-']w+)*       # words w/ optional internal hyphens/apostrophe
              |(?:[+/-@&*])         # special characters with meanings
            """

在代码中：

>>> import nltk
>>> nltk.__version__
'3.1'
>>> pattern = r"""
... (?x)                   # set flag to allow verbose regexps
... (?:[A-Z].)+           # abbreviations,e.g. U.S.A.
... |d+(?:.d+)?%?       # numbers,incl. currency and percentages
... |w+(?:[-']w+)*       # words w/ optional internal hyphens/apostrophe
... |(?:[+/-@&*])         # special characters with meanings
... """
>>> from nltk.tokenize.regexp import RegexpTokenizer
>>> tokeniser=RegexpTokenizer(pattern)
>>> line="My weight is about 68 kg,'grams']

如果没有NLTK,使用python的re模块,我们发现本机不支持旧的正则表达式模式：

>>> pattern1 = r"""(?x)               # set flag to allow verbose regexps
...               ([A-Z].)+         # abbreviations,e.g. U.S.A.
...               |$?d+(.d+)?%? # numbers,incl. currency and percentages
...               |w+([-']w+)*    # words w/ optional internal hyphens/apostrophe
...               |[+/-@&*]        # special characters with meanings
...               |Sw*                       # any sequence of word characters# 
... """            
>>> text="My weight is about 68 kg,+/- 10 grams."
>>> re.findall(pattern1,text)
[('','')]
>>> pattern2 = r"""(?x)                   # set flag to allow verbose regexps
...                       (?:[A-Z].)+           # abbreviations,e.g. U.S.A.
...                       |d+(?:.d+)?%?       # numbers,incl. currency and percentages
...                       |w+(?:[-']w+)*       # words w/ optional internal hyphens/apostrophe
...                       |(?:[+/-@&*])         # special characters with meanings
...                     """
>>> text="My weight is about 68 kg,+/- 10 grams."
>>> re.findall(pattern2,text)
['My','grams']

注意：NLTK的RegexpTokenizer如何编译正则表达式的变化会使NLTK’s Regular Expression Tokenizer上的示例过时.

（编辑：李大同）

【声明】本站内容均来自网络，其相关言论仅代表作者个人观点，不代表本站立场。若无意侵犯到您的权利，请及时与联系站长删除相关内容!