Python:正则表达式无法正常工作
发布时间:2020-12-20 11:49:09 所属栏目:Python 来源:网络整理
导读:我正在使用以下正则表达式,它假设找到字符串’U.S.A.’,但它只获得’A’,是否有人知道什么是错的? #INPUTimport retext = 'That U.S.A. poster-print costs $12.40...'print re.findall(r'([A-Z].)+',text)#OUTPUT['A.'] 预期产出: ['U.S.A.'] 我正在关注
我正在使用以下正则表达式,它假设找到字符串’U.S.A.’,但它只获得’A’,是否有人知道什么是错的?
#INPUT import re text = 'That U.S.A. poster-print costs $12.40...' print re.findall(r'([A-Z].)+',text) #OUTPUT ['A.'] 预期产出: ['U.S.A.'] 我正在关注NLTK Book,第3.7章here,它有一套正则表达式,但它只是不起作用.我在Python 2.7和3.4中都尝试过它. >>> text = 'That U.S.A. poster-print costs $12.40...' >>> pattern = r'''(?x) # set flag to allow verbose regexps ... ([A-Z].)+ # abbreviations,e.g. U.S.A. ... | w+(-w+)* # words with optional internal hyphens ... | $?d+(.d+)?%? # currency and percentages,e.g. $12.40,82% ... | ... # ellipsis ... | [][.,;"'?():-_`] # these are separate tokens; includes ],[ ... ''' >>> nltk.regexp_tokenize(text,pattern) ['That','U.S.A.','poster-print','costs','$12.40','...'] nltk.regexp_tokenize()与re.findall()的工作方式相同,我想我的python在某种程度上无法按预期识别正则表达式.上面列出的正则表达式输出: [('','',''),('A.',('','-print','.40'),'')] 解决方法
可能,这与先前使用在v3.1中废除的nltk.internals.compile_regexp_to_noncapturing()编译正则表达式有关,参见
here)
>>> import nltk >>> nltk.__version__ '3.0.5' >>> pattern = r'''(?x) # set flag to allow verbose regexps ... ([A-Z].)+ # abbreviations,e.g. U.S.A. ... | $?d+(.d+)?%? # numbers,incl. currency and percentages ... | w+([-']w+)* # words w/ optional internal hyphens/apostrophe ... | [+/-@&*] # special characters with meanings ... ''' >>> >>> from nltk.tokenize.regexp import RegexpTokenizer >>> tokeniser=RegexpTokenizer(pattern) >>> line="My weight is about 68 kg,+/- 10 grams." >>> tokeniser.tokenize(line) ['My','weight','is','about','68','kg','+','/','-','10','grams'] 但它在NLTK v3.1中不起作用: >>> import nltk >>> nltk.__version__ '3.1' >>> pattern = r'''(?x) # set flag to allow verbose regexps ... ([A-Z].)+ # abbreviations,incl. currency and percentages ... | w+([-']w+)* # words w/ optional internal hyphens/apostrophe ... | [+/-@&*] # special characters with meanings ... ''' >>> from nltk.tokenize.regexp import RegexpTokenizer >>> tokeniser=RegexpTokenizer(pattern) >>> line="My weight is about 68 kg,+/- 10 grams." >>> tokeniser.tokenize(line) [('','')] 稍微修改一下你的正则表达式组的定义,你可以使用这个正则表达式在NLTK v3.1中使用相同的模式: pattern = r"""(?x) # set flag to allow verbose regexps (?:[A-Z].)+ # abbreviations,e.g. U.S.A. |d+(?:.d+)?%? # numbers,incl. currency and percentages |w+(?:[-']w+)* # words w/ optional internal hyphens/apostrophe |(?:[+/-@&*]) # special characters with meanings """ 在代码中: >>> import nltk >>> nltk.__version__ '3.1' >>> pattern = r""" ... (?x) # set flag to allow verbose regexps ... (?:[A-Z].)+ # abbreviations,e.g. U.S.A. ... |d+(?:.d+)?%? # numbers,incl. currency and percentages ... |w+(?:[-']w+)* # words w/ optional internal hyphens/apostrophe ... |(?:[+/-@&*]) # special characters with meanings ... """ >>> from nltk.tokenize.regexp import RegexpTokenizer >>> tokeniser=RegexpTokenizer(pattern) >>> line="My weight is about 68 kg,'grams'] 如果没有NLTK,使用python的re模块,我们发现本机不支持旧的正则表达式模式: >>> pattern1 = r"""(?x) # set flag to allow verbose regexps ... ([A-Z].)+ # abbreviations,e.g. U.S.A. ... |$?d+(.d+)?%? # numbers,incl. currency and percentages ... |w+([-']w+)* # words w/ optional internal hyphens/apostrophe ... |[+/-@&*] # special characters with meanings ... |Sw* # any sequence of word characters# ... """ >>> text="My weight is about 68 kg,+/- 10 grams." >>> re.findall(pattern1,text) [('','')] >>> pattern2 = r"""(?x) # set flag to allow verbose regexps ... (?:[A-Z].)+ # abbreviations,e.g. U.S.A. ... |d+(?:.d+)?%? # numbers,incl. currency and percentages ... |w+(?:[-']w+)* # words w/ optional internal hyphens/apostrophe ... |(?:[+/-@&*]) # special characters with meanings ... """ >>> text="My weight is about 68 kg,+/- 10 grams." >>> re.findall(pattern2,text) ['My','grams'] 注意:NLTK的RegexpTokenizer如何编译正则表达式的变化会使NLTK’s Regular Expression Tokenizer上的示例过时. (编辑:李大同) 【声明】本站内容均来自网络,其相关言论仅代表作者个人观点,不代表本站立场。若无意侵犯到您的权利,请及时与联系站长删除相关内容! |