Python – 从一行中的给定点查找前后五个单词的最佳代码

for file in listing:
    file2 = open('//home/user/Documents/Corpus/Files/'+file,'r')
    for line in file2:
        linetrigrams = trigram_split(line)
        for trigram in linetrigrams:
            if trigram in trigrams:
                line2 = line.replace(trigram,'###').split('###')
                window = (line2[0].split()[-5:] + line2[1].split()[:5])
                for item in window:
                    if item in mostfreq:
                        matrix[trigram][mostfreq[item]] += 1

有什么建议可以更快地做到这一点?可能是我在这里使用完全错误的数据结构. trigram_split()只给出行中的所有三元组(这是我需要为其创建向量的单位). “Trigrams”基本上是一个大约一百万个三元组的列表,我关注的是创建向量. Window获取trigram之前和之后的5个单词(如果该trigram在列表中),然后检查它们是否在列表MostFreq(这是一个1000字的字典作为键,每个对应一个整数[ 0-100]作为储值).然后,这用于更新Matrix(这是一个带有列表([0] * 1000)作为存储值的字典).伪矩阵中的对应值以这种方式递增.



>如何处理重叠比赛?例如,如果文字是我们是说NI的骑士! NI NI NI NI NI NI NI NI和你搜索NI你会回来什么?这会发生在你身上吗?


import re
zen = """Beautiful is better than ugly. 
Explicit is better than implicit. 
Simple is better than complex. 
Complex is better than complicated. 
Flat is better than nested. 
Sparse is better than dense. 
Readability counts. 
Special cases aren't special enough to break the rules. 
Although practicality beats purity. 
Errors should never pass silently. 
Unless explicitly silenced. 
In the face of ambiguity,refuse the temptation to guess. 
There should be one-- and preferably only one --obvious way to do it. 
Although that way may not be obvious at first unless you're Dutch. 
Now is better than never. 
Although never is often better than *right* now. 
If the implementation is hard to explain,it's a bad idea. 
If the implementation is easy to explain,it may be a good idea. 
Namespaces are one honking great idea -- let's do more of those!"""

searchvar = 'Dutch'
dutchre = re.compile(r"""((?:S+s*){,5})(%s)((?:S+s*){,5})""" % searchvar,re.IGNORECASE | re.MULTILINE)
print dutchre.findall(zen)
#[("obvious at first unless you're ",'Dutch','. Now is better than ')]

替代方法,导致更糟糕的结果IMO ……

def splitAndFind(text,phrase):
    text2 = text.replace(phrase,"###").split("###")
    if len(text2) > 1:
        return ((text2[0].split()[-5:],text2[1].split()[:5]))
print splitAndFind(zen,'Dutch')
#(['obvious','at','first','unless',"you're"],# ['.','Now','is','better','than'])


timeit dutchre.findall(zen)
1000 loops,best of 3: 814 us per loop

timeit 'Dutch' in zen
1000000 loops,best of 3: 650 ns per loop

timeit zen.find('Dutch')
1000000 loops,best of 3: 812 ns per loop

timeit splitAndFind(zen,'Dutch')
10000 loops,best of 3: 18.8 us per loop


