python – 文本中关键字的精确命中集

发布时间：2020-12-20 13:30:02 所属栏目：Python 来源：网络整理

导读：文本中有一些关键字及其出现的开始/结束位置.关键字可能会部分重叠“某事” – “东西” /“一些” /“东西”： keywords_occurences = { "key_1": [(11,59)],"key_2": [(24,46),(301,323),(1208,1230),(1673,1695)],"key_3": [(24,56),1240)],...} 我需要为

文本中有一些关键字及其出现的开始/结束位置.关键字可能会部分重叠“某事” – > “东西” /“一些” /“东西”：

keywords_occurences = {
    "key_1": [(11,59)],"key_2": [(24,46),(301,323),(1208,1230),(1673,1695)],"key_3": [(24,56),1240)],...
}

我需要为每个关键字选择一个位置,以便它们不重叠,因为这种情况解决方案是：

key_1: 11-59
key_2: 301-323     (or 1673-1695,it does not matter)
key_3: 1208-1240

如果无法完成此操作,请选择唯一的非重叠键的最大数量.

看起来像一种“精确打击设置”问题,但我找不到算法描述.

解决方法

我认为以下代码可以满足您的需求.

#!/usr/bin/env python

# keyword occurrences -> [('key_1',(11,59)),('key_2',333)),('key_3',())]
kw_all_occ = {"key_1" : [(11,"key_2" : [(24,333),1240),1705)],"key_3" : [(24,1230)]}

def non_overlapping_occ(occ):
    # dictionary with all keyword occurrences
    all_occ = dict({})
    all_occ.update(occ)

    # list with the first non overlapping occurrences of every keyword -> 
    # [('key_1',(start_1,end_1)),(start_2,end_2)),...]
    result = []

    # Sort keywords by length -> [(22,'key_3'),(32,'key_2'),(48,'key_1')]
    kw_lengths = []
    for k,v in all_occ.iteritems():
        kw_lengths.append((v[0][1] - v[0][0],k))
    kw_lengths.sort()

    while len(kw_lengths):
        # Current longest keyword
        longest_keyword = kw_lengths.pop(-1)[1]
        try:
            result.append((longest_keyword,all_occ[longest_keyword][0]))
            # Remove occurrences overlapping any occurrence of the current
            # longest_keyword value
            for item in all_occ[longest_keyword]:
                start = item[0]
                end = item[1]
                for l,k in kw_lengths:
                    v = all_occ[k]
                    all_occ[k] = filter(lambda x: (x[0] > end) | (x[1] < start),v)

        except IndexError:
            result.append((longest_keyword,()))

    return result

print non_overlapping_occ(kw_all_occ)

它产生以下输出：

vicent@deckard:~$python prova.py 
[('key_1',())]

请注意,我没有在代码中使用集合.您的问题只是建议集合可以帮助解决问题所以我理解使用集合不是强制性的.

另请注意,代码尚未经过深度测试,但似乎工作得很好(它也能正确处理您问题中提供的关键字出现次数.事实上,这些事件可以通过一些更简单但不太通用的代码来解决).

（编辑：李大同）

【声明】本站内容均来自网络，其相关言论仅代表作者个人观点，不代表本站立场。若无意侵犯到您的权利，请及时与联系站长删除相关内容!