python中的模糊文本搜索
我想知道是否有任何
Python库可以进行模糊文本搜索.例如:
>我有三个关键字“letter”,“stamp”和“mail”. 我尝试过fuzzywuzzy,但没有解决我的问题.另一个图书馆飞快看起来很强大,但我找不到合适的功能…… 解决方法
{1}
你可以在Whoosh 2.7中做到这一点.它通过添加插件whoosh.qparser.FuzzyTermPlugin进行模糊搜索:
要添加模糊插件: parser = qparser.QueryParser("fieldname",my_index.schema) parser.add_plugin(qparser.FuzzyTermPlugin()) 将模糊插件添加到解析器后,可以通过添加?后跟可选的最大编辑距离来指定模糊项.如果未指定编辑距离,则默认值为1. 例如,以下“模糊”术语查询: letter~ letter~2 letter~2/3 {2}要保持单词顺序,请使用查询whoosh.query.Phrase,但您应该使用whoosh.qparser.SequencePlugin替换Phrase插件,该插件允许您在短语中使用模糊术语: "letter~ stamp~ mail~" 要使用序列插件替换默认短语插件: parser = qparser.QueryParser("fieldname",my_index.schema) parser.remove_plugin_class(qparser.PhrasePlugin) parser.add_plugin(qparser.SequencePlugin()) {3}要允许两者之间的单词,请将短语查询中的slop arg初始化为更大的数字: whoosh.query.Phrase(fieldname,words,slop=1,boost=1.0,char_ranges=None)
您还可以在Query中定义slop,如下所示: "letter~ stamp~ mail~"~10 {4}整体解决方案: {4.a}索引器就像: from whoosh.index import create_in from whoosh.fields import * schema = Schema(title=TEXT(stored=True),content=TEXT) ix = create_in("indexdir",schema) writer = ix.writer() writer.add_document(title=u"First document",content=u"This is the first document we've added!") writer.add_document(title=u"Second document",content=u"The second one is even more interesting!") writer.add_document(title=u"Third document",content=u"letter first,stamp second,mail third") writer.add_document(title=u"Fourth document",content=u"stamp first,mail third") writer.add_document(title=u"Fivth document",mail third") writer.add_document(title=u"Sixth document",content=u"letters first,stamps second,mial third wrong") writer.add_document(title=u"Seventh document",letters second,mail third") writer.commit() {4.b} Searcher会像: from whoosh.qparser import QueryParser,FuzzyTermPlugin,PhrasePlugin,SequencePlugin with ix.searcher() as searcher: parser = QueryParser(u"content",ix.schema) parser.add_plugin(FuzzyTermPlugin()) parser.remove_plugin_class(PhrasePlugin) parser.add_plugin(SequencePlugin()) query = parser.parse(u""letter~2 stamp~2 mail~2"~10") results = searcher.search(query) print "nb of results =",len(results) for r in results: print r 结果如下: nb of results = 2 <Hit {'title': u'Sixth document'}> <Hit {'title': u'Third document'}> {5}如果要将模糊搜索设置为默认值而不在查询的每个单词中使用语法单词~n,则可以像这样初始化QueryParser: from whoosh.query import FuzzyTerm parser = QueryParser(u"content",ix.schema,termclass = FuzzyTerm) 现在您可以使用查询“letter stamp mail”~10但请记住,FuzzyTerm具有默认编辑距离maxdist = 1.如果您想要更大的编辑距离,请对该类进行个性化: class MyFuzzyTerm(FuzzyTerm): def __init__(self,fieldname,text,maxdist=2,prefixlength=1,constantscore=True): super(D,self).__init__(fieldname,boost,maxdist,prefixlength,constantscore) # super().__init__() for Python 3 I think 参考文献: > whoosh.query.Phrase (编辑:李大同) 【声明】本站内容均来自网络,其相关言论仅代表作者个人观点,不代表本站立场。若无意侵犯到您的权利,请及时与联系站长删除相关内容! |