如何从给定的字符串列表自动生成正则表达式?
发布时间:2020-12-14 06:36:02 所属栏目:百科 来源:网络整理
导读:您将获得2个Strings – A和B的列表。找到匹配A中所有字符串的最短正则表达式,否则为B.请注意,该正则表达式可以匹配/不匹配不在A中而不是B中的其他字符串。简单,我们可以假设我们的字母大小只有2个字符 – 0和1.也只允许这些运算符: * – 0或以上 ? – 0
您将获得2个Strings – A和B的列表。找到匹配A中所有字符串的最短正则表达式,否则为B.请注意,该正则表达式可以匹配/不匹配不在A中而不是B中的其他字符串。简单,我们可以假设我们的字母大小只有2个字符 – 0和1.也只允许这些运算符:
* – 0或以上 为了简单起见,不允许使用正则表达式运算符。我不知道是否允许或者操作符(|)来简化问题。 A和B之间没有共同点。这里有些例子: A=[00,01,10] B=[11] answer = 1*0+1* A=[00,11] B=[10] answer = 0*1*
一种解决这个问题的方法是使用遗传算法。我碰巧有一个
genetic solver laying around,所以我使用以下算法将其应用于您的问题:
从所需投入中获取不同的标记作为基因 >确保生成的字符串是一个有效的正则表达式 >直到找到一个成功的正则表达式 >从不同的令牌的数量开始,并根据需要递增 这是我在C#中的实现 private static void GenerateRegex(IEnumerable<string> target,IEnumerable<string> dontMatch) { string distinctSymbols = new String(target.SelectMany(x => x).Distinct().ToArray()); string genes = distinctSymbols + "?*()+"; Func<string,uint> calcFitness = str => { if (str.Count(x => x == '(') != str.Count(x => x == ')')) { return Int32.MaxValue; } if ("?*+".Any(x => str[0] == x)) { return Int32.MaxValue; } if ("?*+?*+".ToArray().Permute(2) .Any(permutation => str.IndexOf(new string(permutation.ToArray())) != -1)) { return Int32.MaxValue; } Regex regex; try { regex = new Regex("^" + str + "$"); } catch (Exception) { return Int32.MaxValue; } uint fitness = target.Aggregate<string,uint>(0,(current,t) => current + (regex.IsMatch(t) ? 0U : 1)); uint nonFitness = dontMatch.Aggregate<string,t) => current + (regex.IsMatch(t) ? 10U : 0)); return fitness + nonFitness; }; for (int targetGeneLength = distinctSymbols.Length; targetGeneLength < genes.Length * 2; targetGeneLength++) { string best = new GeneticSolver(50).GetBestGenetically(targetGeneLength,genes,calcFitness,true); if (calcFitness(best) != 0) { Console.WriteLine("-- not solved with regex of length " + targetGeneLength); continue; } Console.WriteLine("solved with: " + best); break; } } 并将其应用于您的样品的结果: public void Given_Sample_A() { var target = new[] { "00","01","10" }; var dontMatch = new[] { "11" }; GenerateRegex(target,dontMatch); } 输出: Generation 1 best: 10 (2) Generation 2 best: 0+ (2) Generation 5 best: 0* (2) Generation 8 best: 00 (2) Generation 9 best: 01 (2) -- not solved with regex of length 2 Generation 1 best: 10* (2) Generation 3 best: 00* (2) Generation 4 best: 01+ (2) Generation 6 best: 10+ (2) Generation 9 best: 00? (2) Generation 11 best: 00+ (2) Generation 14 best: 0?1 (2) Generation 21 best: 0*0 (2) Generation 37 best: 1?0 (2) Generation 43 best: 10? (2) Generation 68 best: 01* (2) Generation 78 best: 1*0 (2) Generation 79 best: 0*1 (2) Generation 84 best: 0?0 (2) Generation 127 best: 01? (2) Generation 142 best: 0+1 (2) Generation 146 best: 0+0 (2) Generation 171 best: 1+0 (2) -- not solved with regex of length 3 Generation 1 best: 1*0+ (1) Generation 2 best: 0+1* (1) Generation 20 best: 1?0+ (1) Generation 31 best: 1?0* (1) -- not solved with regex of length 4 Generation 1 best: 1*00? (1) Generation 2 best: 0*1?0 (1) Generation 3 best: 1?0?0 (1) Generation 4 best: 1?00? (1) Generation 8 best: 1?00* (1) Generation 12 best: 1*0?0 (1) Generation 13 best: 1*00* (1) Generation 41 best: 0*10* (1) Generation 44 best: 1*0*0 (1) -- not solved with regex of length 5 Generation 1 best: 0+(1)? (1) Generation 36 best: 0+()1? (1) Generation 39 best: 0+(1?) (1) Generation 61 best: 1*0+1? (0) solved with: 1*0+1? 第二个样本: public void Given_Sample_B() { var target = new[] { "00","11" }; var dontMatch = new[] { "10" }; GenerateRegex(target,dontMatch); } 输出: Generation 1 best: 00 (2) Generation 2 best: 01 (2) Generation 7 best: 0* (2) Generation 12 best: 0+ (2) Generation 33 best: 1+ (2) Generation 36 best: 1* (2) Generation 53 best: 11 (2) -- not solved with regex of length 2 Generation 1 best: 00* (2) Generation 2 best: 0+0 (2) Generation 7 best: 0+1 (2) Generation 12 best: 00? (2) Generation 15 best: 01* (2) Generation 16 best: 0*0 (2) Generation 19 best: 01+ (2) Generation 30 best: 0?0 (2) Generation 32 best: 0*1 (2) Generation 42 best: 11* (2) Generation 43 best: 1+1 (2) Generation 44 best: 00+ (2) Generation 87 best: 01? (2) Generation 96 best: 0?1 (2) Generation 125 best: 11? (2) Generation 126 best: 1?1 (2) Generation 135 best: 11+ (2) Generation 149 best: 1*1 (2) -- not solved with regex of length 3 Generation 1 best: 0*1* (0) solved with: 0*1* (编辑:李大同) 【声明】本站内容均来自网络,其相关言论仅代表作者个人观点,不代表本站立场。若无意侵犯到您的权利,请及时与联系站长删除相关内容! |