python – 文本分段:将输入与字典中最长的单词匹配的算法
发布时间:2020-12-20 11:27:28 所属栏目:Python 来源:网络整理
导读:我需要将一个字符串拆分成单词,这样每个单词都来自字典.还要确保选择左侧最长的单词.于是 thisisinsane = this is insane (correct as longest possible word from left)thisisinsane = this is in sane(wrong) Assuming 'this','is','in','insane' are all
我需要将一个字符串拆分成单词,这样每个单词都来自字典.还要确保选择左侧最长的单词.于是
thisisinsane => this is insane (correct as longest possible word from left) thisisinsane => this is in sane(wrong) Assuming 'this','is','in','insane' are all words in the dictionary. 我设法通过从字符串的末尾遍历到匹配最长字的开头来解决这个问题.但问题开始让我们因为这些问题而出现问题. shareasale => share as ale(wrong as 'ale' not in the dictionary) shareasale => share a sale(correct) Assuming 'share','a','sale' are all words in the dictionary unlike 'ale'. 我试图通过删除在遇到错误之前找到的有效段来解决此问题,即 shareasale => 'share' and 'as' (error = 'ale') 并从字典中删除一次,然后解决问题.所以 shareasale => no valid segments when share is removed shareasale => share a sale (when 'as' is removed from the dictionary. 因此我设法解决了这个问题.但后来我无法解决这个问题 asignas => 'as' ( error = 'ignas') 然后我的解决方案将从字典中删除’as’并尝试解决它 asignas => 'a' 'sign' (error = 'as') 因为在新的递归调用’as’已从字典中删除.我写的函数是in this link.我希望有人可以通过它并帮助我找到一个更好的算法来解决这个问题,建议修改我现有的算法. 解决方法
基本上你的问题是树问题,在每个级别,形成树前缀的所有单词形成分支.不留下字符串一部分的分支是正确的解决方案.
thisisinsane | | (this)isinsane / / (this,i)sinsane (this,is)insane / / / / (this,i,sin)ane (this,is,in)sane (this,insane) / / (this,in,sane) 因此,在此示例中有两个解决方案,但我们希望使用最长的单词选择解决方案,即我们希望使用深度优先搜索策略从右侧探索树. 所以我们的算法应该: >按降序长度排序字典. 此解决方案的示例实现: def segment(string,wset): """Segments a string into words prefering longer words givens a dictionary wset.""" # Sort wset in decreasing string order wset.sort(key=len,reverse=True) result = tokenize(string,wset,"") if result: result.pop() # Remove the empty string token result.reverse() # Put the list into correct order return result else: raise Exception("No possible segmentation!") def tokenize(string,token): """Returns either false if the string can't be segmented by the current wset or a list of words that segment the string in reverse order.""" # Are we done yet? if string == "": return [token] # Find all possible prefixes for pref in wset: if string.startswith(pref): res = tokenize(string.replace(pref,'',1),pref) if res: res.append(token) return res # Not possible return False print segment("thisisinsane",['this','insane']) # this is insane print segment("shareasale",['share','sale','as']) # share a sale print segment("asignas",['as','sign','a']) # a sign as (编辑:李大同) 【声明】本站内容均来自网络,其相关言论仅代表作者个人观点,不代表本站立场。若无意侵犯到您的权利,请及时与联系站长删除相关内容! |