php – 具有错误字符容差的最长Common Substring
我在这里找到一个脚本,在寻找最低公共子串时效果很好.
但是,我需要它来容忍一些不正确/缺少的字符.我希望能够输入所需的相似性百分比,或者可以指定允许的丢失/错误字符的数量. 例如,我想找到这个字符串: 大黄校车 在这个字符串里面: 那天下午他们骑着bigyellow schook巴士 这是我目前正在使用的代码: function longest_common_substring($words) { $words = array_map('strtolower',array_map('trim',$words)); $sort_by_strlen = create_function('$a,$b','if (strlen($a) == strlen($b)) { return strcmp($a,$b); } return (strlen($a) < strlen($b)) ? -1 : 1;'); usort($words,$sort_by_strlen); // We have to assume that each string has something in common with the first // string (post sort),we just need to figure out what the longest common // string is. If any string DOES NOT have something in common with the first // string,return false. $longest_common_substring = array(); $shortest_string = str_split(array_shift($words)); while (sizeof($shortest_string)) { array_unshift($longest_common_substring,''); foreach ($shortest_string as $ci => $char) { foreach ($words as $wi => $word) { if (!strstr($word,$longest_common_substring[0] . $char)) { // No match break 2; } } // we found the current char in each word,so add it to the first longest_common_substring element,// then start checking again using the next char as well $longest_common_substring[0].= $char; } // We've finished looping through the entire shortest_string. // Remove the first char and start all over. Do this until there are no more // chars to search on. array_shift($shortest_string); } // If we made it here then we've run through everything usort($longest_common_substring,$sort_by_strlen); return array_pop($longest_common_substring); } 任何帮助深表感谢. UPDATE PHP levenshtein函数限制为255个字符,我正在搜索的一些干草堆是1000个字符. 解决方法
写这个作为第二个答案,因为它不是基于我以前的(坏)一个.
此代码基于http://en.wikipedia.org/wiki/Wagner%E2%80%93Fischer_algorithm和http://en.wikipedia.org/wiki/Approximate_string_matching#Problem_formulation_and_algorithms 给出$needle,它返回$haystack的一个(可能是几个)最小levenshtein子串.现在,levenshtein距离只是编辑距离的一个衡量标准,它可能实际上并不适合您的需求. ‘hte’在这个指标上更接近’他’而不是”’.我提出的一些例子表明了这种技术的局限性.我相信这比我之前给出的答案要可靠得多,但让我知道它对你有什么用. // utility function - returns the key of the array minimum function array_min_key($arr) { $min_key = null; $min = PHP_INT_MAX; foreach($arr as $k => $v) { if ($v < $min) { $min = $v; $min_key = $k; } } return $min_key; } // Calculate the edit distance between two strings function edit_distance($string1,$string2) { $m = strlen($string1); $n = strlen($string2); $d = array(); // the distance from '' to substr(string,$i) for($i=0;$i<=$m;$i++) $d[$i][0] = $i; for($i=0;$i<=$n;$i++) $d[0][$i] = $i; // fill-in the edit distance matrix for($j=1; $j<=$n; $j++) { for($i=1; $i<=$m; $i++) { // Using,for example,the levenshtein distance as edit distance list($p_i,$p_j,$cost) = levenshtein_weighting($i,$j,$d,$string1,$string2); $d[$i][$j] = $d[$p_i][$p_j]+$cost; } } return $d[$m][$n]; } // Helper function for edit_distance() function levenshtein_weighting($i,$string2) { // if the two letters are equal,cost is 0 if($string1[$i-1] === $string2[$j-1]) { return array($i-1,$j-1,0); } // cost we assign each operation $cost['delete'] = 1; $cost['insert'] = 1; $cost['substitute'] = 1; // cost of operation + cost to get to the substring we perform it on $total_cost['delete'] = $d[$i-1][$j] + $cost['delete']; $total_cost['insert'] = $d[$i][$j-1] + $cost['insert']; $total_cost['substitute'] = $d[$i-1][$j-1] + $cost['substitute']; // return the parent array keys of $d and the operation's cost $min_key = array_min_key($total_cost); if ($min_key == 'delete') { return array($i-1,$cost['delete']); } elseif($min_key == 'insert') { return array($i,$cost['insert']); } else { return array($i-1,$cost['substitute']); } } // attempt to find the substring of $haystack most closely matching $needle function shortest_edit_substring($needle,$haystack) { // initialize edit distance matrix $m = strlen($needle); $n = strlen($haystack); $d = array(); for($i=0;$i<=$m;$i++) { $d[$i][0] = $i; $backtrace[$i][0] = null; } // instead of strlen,we initialize the top row to all 0's for($i=0;$i<=$n;$i++) { $d[0][$i] = 0; $backtrace[0][$i] = null; } // same as the edit_distance calculation,but keep track of how we got there for($j=1; $j<=$n; $j++) { for($i=1; $i<=$m; $i++) { list($p_i,$needle,$haystack); $d[$i][$j] = $d[$p_i][$p_j]+$cost; $backtrace[$i][$j] = array($p_i,$p_j); } } // now find the minimum at the bottom row $min_key = array_min_key($d[$m]); $current = array($m,$min_key); $parent = $backtrace[$m][$min_key]; // trace up path to the top row while(! is_null($parent)) { $current = $parent; $parent = $backtrace[$current[0]][$current[1]]; } // and take a substring based on those results $start = $current[1]; $end = $min_key; return substr($haystack,$start,$end-$start); } // some testing $data = array( array('foo',' foo'),array('fat','far'),array('dat burn','rugburn')); $data[] = array('big yellow school bus','they rode the bigyellow schook bus that afternoon'); $data[] = array('bus','they rode the bigyellow schook bus that afternoon'); $data[] = array('big','they rode the bigyellow schook bus that afternoon'); $data[] = array('nook','they rode the bigyellow schook bus that afternoon'); $data[] = array('they','console,controller and games are all in very good condition,only played occasionally. includes power cable,controller charge cable and audio cable. smoke free house. pes 2011 super street fighter'); $data[] = array('controker',controller charge cable and audio cable. smoke free house. pes 2011 super street fighter'); foreach($data as $dat) { $substring = shortest_edit_substring($dat[0],$dat[1]); $dist = edit_distance($dat[0],$substring); printf("Found |%s| in |%s|,matching |%s| with edit distance %dn",$substring,$dat[1],$dat[0],$dist); } (编辑:李大同) 【声明】本站内容均来自网络,其相关言论仅代表作者个人观点,不代表本站立场。若无意侵犯到您的权利,请及时与联系站长删除相关内容! |