Python – 从一行中的给定点查找前后五个单词的最佳代码

发布时间：2020-12-20 11:18:09 所属栏目：Python 来源：网络整理

导读：我正在尝试编写代码来查找特定短语两侧的5个单词.很容易,但我必须在大量数据上执行此操作,因此代码需要是最佳的！ for file in listing: file2 = open('//home/user/Documents/Corpus/Files/'+file,'r') for line in file2: linetrigrams = trigram_split(li

我正在尝试编写代码来查找特定短语两侧的5个单词.很容易,但我必须在大量数据上执行此操作,因此代码需要是最佳的！

for file in listing:
    file2 = open('//home/user/Documents/Corpus/Files/'+file,'r')
    for line in file2:
        linetrigrams = trigram_split(line)
        for trigram in linetrigrams:
            if trigram in trigrams:
                line2 = line.replace(trigram,'###').split('###')
                window = (line2[0].split()[-5:] + line2[1].split()[:5])
                for item in window:
                    if item in mostfreq:
                        matrix[trigram][mostfreq[item]] += 1

有什么建议可以更快地做到这一点？可能是我在这里使用完全错误的数据结构. trigram_split()只给出行中的所有三元组(这是我需要为其创建向量的单位). “Trigrams”基本上是一个大约一百万个三元组的列表,我关注的是创建向量. Window获取trigram之前和之后的5个单词(如果该trigram在列表中),然后检查它们是否在列表MostFreq(这是一个1000字的字典作为键,每个对应一个整数[ 0-100]作为储值).然后,这用于更新Matrix(这是一个带有列表([0] * 1000)作为存储值的字典).伪矩阵中的对应值以这种方式递增.

解决方法

在权衡各种方法时要考虑的几个重要因素：

>多线与单线
>线的长度
>搜索模式的长度
>搜索匹配率
>如果之前/之后少于5个单词怎么办
>如何处理非单词,非空格字符(换行符和标点符号)
>不区分大小写？
>如何处理重叠比赛？例如,如果文字是我们是说NI的骑士！ NI NI NI NI NI NI NI NI和你搜索NI你会回来什么？这会发生在你身上吗？
>如果您的数据中包含###,该怎么办？
>你宁愿错过一些,还是回复错误的结果？可能存在一些权衡,特别是对于杂乱的现实世界数据.

你可以尝试正则表达式……

import re
zen = """Beautiful is better than ugly. 
Explicit is better than implicit. 
Simple is better than complex. 
Complex is better than complicated. 
Flat is better than nested. 
Sparse is better than dense. 
Readability counts. 
Special cases aren't special enough to break the rules. 
Although practicality beats purity. 
Errors should never pass silently. 
Unless explicitly silenced. 
In the face of ambiguity,refuse the temptation to guess. 
There should be one-- and preferably only one --obvious way to do it. 
Although that way may not be obvious at first unless you're Dutch. 
Now is better than never. 
Although never is often better than *right* now. 
If the implementation is hard to explain,it's a bad idea. 
If the implementation is easy to explain,it may be a good idea. 
Namespaces are one honking great idea -- let's do more of those!"""

searchvar = 'Dutch'
dutchre = re.compile(r"""((?:S+s*){,5})(%s)((?:S+s*){,5})""" % searchvar,re.IGNORECASE | re.MULTILINE)
print dutchre.findall(zen)
#[("obvious at first unless you're ",'Dutch','. Now is better than ')]

替代方法,导致更糟糕的结果IMO ……

def splitAndFind(text,phrase):
    text2 = text.replace(phrase,"###").split("###")
    if len(text2) > 1:
        return ((text2[0].split()[-5:],text2[1].split()[:5]))
print splitAndFind(zen,'Dutch')
#(['obvious','at','first','unless',"you're"],# ['.','Now','is','better','than'])

在iPython中,您可以轻松地计时：

timeit dutchre.findall(zen)
1000 loops,best of 3: 814 us per loop

timeit 'Dutch' in zen
1000000 loops,best of 3: 650 ns per loop

timeit zen.find('Dutch')
1000000 loops,best of 3: 812 ns per loop

timeit splitAndFind(zen,'Dutch')
10000 loops,best of 3: 18.8 us per loop

（编辑：李大同）

【声明】本站内容均来自网络，其相关言论仅代表作者个人观点，不代表本站立场。若无意侵犯到您的权利，请及时与联系站长删除相关内容!