加入收藏 | 设为首页 | 会员中心 | 我要投稿 李大同 (https://www.lidatong.com.cn/)- 科技、建站、经验、云计算、5G、大数据,站长网!
当前位置: 首页 > 百科 > 正文

正则表达式在Python中分割单词

发布时间:2020-12-14 06:28:06 所属栏目:百科 来源:网络整理
导读:我正在设计一个正则表达式来分割给定文本中的所有实际单词: 输入示例: "John's mom went there,but he wasn't there. So she said: 'Where are you'" 预期产量: ["John's","mom","went","there","but","he","wasn't","So","she","said","Where","are","yo
我正在设计一个正则表达式来分割给定文本中的所有实际单词:

输入示例:

"John's mom went there,but he wasn't there. So she said: 'Where are you'"

预期产量:

["John's","mom","went","there","but","he","wasn't","So","she","said","Where","are","you"]

我想到了一个正则表达式:

"(([^a-zA-Z]+')|('[^a-zA-Z]+))|([^a-zA-Z']+)"

在Python中分割后,结果将包含None项和空格.

如何摆脱无物品?为什么空格不匹配?

编辑:

分割空间,会给出如下项目:[“那里”]

而在非信件上分裂,会给出类似的东西:[“约翰”,“s”]

除了’除了’以外的非字母分割,将会给出如下项:[“”Where“,”you“

您可以使用字符串函数代替正则表达式:
to_be_removed = ".,:!" # all characters to be removed
s = "John's mom went there,but he wasn't there. So she said: 'Where are you!!'"

for c in to_be_removed:
    s = s.replace(c,'')
s.split()

但是,在你的例子中,你不想删除约翰的撇号,但是你想把它删除!因此,字符串操作在这一点上失败,您需要一个精细调整的正则表达式.

编辑:大概一个简单的正则表达式可以解决你的瑕疵:

(w[w']*)

它将捕获所有以字母开始的字符,并且在下一个字符是撇号或字母时保持捕获.

(w[w']*w)

这个第二个正则表达式是针对一个非常具体的情况….第一个正则表达式可以捕获像你这样的字.如果是在单词内(而不是在开头或最后),这个将只能捕捉撇号.但是在这一点上,情况就像是这样,你不能用第二个正则表达式捕捉撇号的苔藓妈妈.您必须决定是否以名义结尾并定义所有权的名义捕获尾部撇号.

例:

rgx = re.compile("([w][w']*w)")
s = "John's mom went there,but he wasn't there. So she said: 'Where are you!!'"
rgx.findall(s)

["John's",'mom','went','there','but','he','So','she','said','Where','are','you']

更新2:我在正则表达式中发现了一个错误!它不能捕获单个字母,后跟撇号像A’.固定全新的正则表达式在这里:

(w[w']*w|w)

rgx = re.compile("(w[w']*w|w)")
s = "John's mom went there,but he wasn't there. So she said: 'Where are you!!' 'A a'"
rgx.findall(s)

["John's",'you','A','a']

(编辑:李大同)

【声明】本站内容均来自网络,其相关言论仅代表作者个人观点,不代表本站立场。若无意侵犯到您的权利,请及时与联系站长删除相关内容!

    推荐文章
      热点阅读