在python中搜索另一个更长列表中的列表项
发布时间:2020-12-20 13:39:25 所属栏目:Python 来源:网络整理
导读:我是这个论坛的新手,因此如果这是一个很长的问题就道歉. 我正在尝试创建一个通用关键字解析器,它接受一个关键字列表和一个文本行列表(可能是从DB或自由格式文本文件生成的).现在我试图根据关键字列表从文本行列表中提取实体,这样我就可以生成三个键输出; 提
我是这个论坛的新手,因此如果这是一个很长的问题就道歉.
我正在尝试创建一个通用关键字解析器,它接受一个关键字列表和一个文本行列表(可能是从DB或自由格式文本文件生成的).现在我试图根据关键字列表从文本行列表中提取实体,这样我就可以生成三个键输出; >提到的关键字 以下是我为此编写的python代码示例.你可以看到我试图在三个阶段完成这个任务; 阶段1 – 接受拒绝序列,以便我可以从文本行列表中删除所有已知的不需要的行 第2阶段(第1遍解析) – 对关键字进行索引类型搜索,以减少我需要进行完整循环搜索的行列表 第3阶段 – 进行完整的循环搜索. 问题:我遇到的问题是阶段3(或代码中的第2阶段)非常低效,并且作为具有4500个元素的关键字列表的示例,对于具有近200万行的文本行,代码运行为超过24小时. 谁能建议一个更好的方法来做传球2? 我是Python的初学者,因此如果我错过了一些明显的东西,那么请提前道歉. ########################################################################################## # The keyWord parser conducts a 2 pass keyword lookup and parsing. # Inputs: # keywordIDsList - Is a list of the IDs of the keyword (Standard declaration: keywordIDsList[]= Hash value of the keyWords) # KeywordDict - is the Dict of all the keywords and the associated ID. # (Standard declaration: keywordDict[keywordID]=(keywordID,keyWord) where keywordID is hash value in keywordIDsList) # valueIDsList - Is a list of the IDs of all the values that need to be parsed (Standard declaration: valueIDsList[]= Unique reference number of the values) # valuesDict - Is the Dict of all the value lines and the associated IDs. # (Standard declaration: valuesDict[uniqueValueKey]=(uniqueValueKey,valueText) where uniqueValueKey is the unique key in valueIDsList) # rejectPattern - A regular expression based pattern for rejecting columns with certain types of patterns. This is an optional field. # Outputs: # parsedHashIDsList - Is the a hash value that is generated for every successful parse results # parsedResultsDict - Is actual parsed value as parsedResultsDict[parsedHashID]=(uniqueValueKey,keywordID,frequencyResult) # successResultIDsList - list of all unique value references that were parsed successfully # rejectResultIDsList - list of all unique value references that were rejected ########################################################################################## def keywordParser(keywordIDsList,keywordDict,valueIDsList,valuesDict,rejectPattern): parsedResultsDict = {} parsedHashIDsList = [] successResultIDsList = [] rejectResultIDsList = [] processListPass1 = [] processListPass2 = [] idxkeyWordDict = {} for keyID in keywordIDsList: keywordID,keyWord = keywordDict[keyID] idxkeyWordDict[keyWord] = (keywordID,keyWord) percCount = 1 # optional: if rejectPattern is provided then reject lines # ## Some python code for processing the reject patterns - this works fine # Pass 1: Index based matching - partial code for index based search for valueID in processListPass1: valKey,valText = valuesDict[valueID] try: keyWordVal,keywordID = idxkeyWordDict[valText] except: processListPass2.append(valueID) percCount = 0 # Pass 2: Text based search and lookup - this part of the code is extremely inefficient for valueID in processListPass2: percCount += 1 valKey,valText = valuesDict[valueID] valSuccess = 'N' for keyID in keywordIDsList: keyWordVal,keywordID = keywordDict[keyID] keySearch = re.findall(keyWordVal,valText,re.DOTALL) if keySearch: parsedHashID = hash(str(valueID) + str(keyID)) parsedResultsDict[parsedHashID] = (valueID,len(keySearch)) valSuccess = 'Y' if valSuccess == 'Y': successResultIDsList.append(valueID) else: rejectResultIDsList.append(valueID) return (parsedResultsDict,parsedHashIDsList,successResultIDsList,rejectResultIDsList) 解决方法
这是
Aho-Corasick string matching algorithm的完美用例.在
this blog post中使用python中的代码示例对类似用例进行了解释.
(编辑:李大同) 【声明】本站内容均来自网络,其相关言论仅代表作者个人观点,不代表本站立场。若无意侵犯到您的权利,请及时与联系站长删除相关内容! |