正则表达式 – 子序列搜索

发布时间：2020-12-14 02:29:12 所属栏目：百科来源：网络整理

导读：我有大量的列表(总共35 MB),我想搜索子序列：每个术语必须按顺序出现,但不一定是连续出现.所以1,2,3匹配每个 1,3,4,5,61,3 但不是 6,1123,6,7 (,是分隔符,而不是要匹配的字符.) 如果没有在数十或数十万个序列上运行正则表达式(/ 1,([^,],)* 2,* * /),我怎样

我有大量的列表(总共35 MB),我想搜索子序列：每个术语必须按顺序出现,但不一定是连续出现.所以1,2,3匹配每个

1,3,4,5,6
1,3

但不是

6,1
123,6,7

(,是分隔符,而不是要匹配的字符.)

如果没有在数十或数十万个序列上运行正则表达式(/ 1,([^,],)* 2,* * /),我怎样才能确定哪个序列是比赛？我可以预处理序列,但内存使用需要保持合理(在现有序列大小的常数因子内,比方说).最长的序列很短,小于一千字节,因此您可以假设查询也很短.

如果单个数字分布在文件上而没有出现在大多数行上,则对它们出现的行号进行简单索引可以加快速度.但是,如果您的数据是以不同顺序重复的相同数字的行,则会更慢.

要构建索引,只需要沿着这些行单次传递数据：

Hash<int,List<int>> index

line_number = 1
foreach(line in filereader)
{
    line_number += 1
    foreach(parsed_number in line)
        index[parsed_number].append(line)
}

该索引可以存储并重新用于数据集.要搜索它只需要这样的东西.请原谅混合的伪代码,我试图尽可能清楚.当它超出可能的匹配时它“返回”,并且当子串的所有元素都出现在该行上时,“产生”一个行号.

// prefilled hash linking number searched for to a list of line numbers
// the lines should be in ascending order
Hash<int,List<int>> index

// The subsequence we're looking for
List<int> subsequence = {1,3}
int len = subsequence.length()

// Take all the lists from the index that match the numbers we're looking for
List<List<int>> lines = index[number] for number in subsequence

// holder for our current search row
// has the current lowest line number each element occurs on 
int[] search = new int[len]
for(i = 0; i < len; i++)
    search[i] = lines[i].pop()

while(true)
{
    // minimum line number,substring position and whether they're equal
    min,pos,eq = search[0],true

    // find the lowest line number and whether they all match
    for(i = 0; i < len; i++)
    {
        if(search[i] < min)
            min,p,eq = search[i],i,false
        else if (search[i] > min)
            eq = false
    }

    // if they do all match every one of the numbers occurs on that row
    if(eq)
    {
        yield min; // line has all the elements

        foreach(list in lines)
            if(list.empty())  // one of the numbers isn't in any more lines
                 return

        // update the search to the next lowest line number for every substring element
        for(i = 0; i < len; i++)
            search[i] = lines[i].pop()
    }
    else
    {
        // the lowest line number for each element is not the same,so discard the lowest one
        if(lines[position].empty()) // there are no more lines for the element we'd be updating
            return

        search[position] = lines[position].pop();
    }
}

笔记：

>这可以简单地扩展到存储行中的位置以及行号,然后在“yield”点处只有一点额外的逻辑将能够确定实际匹配而不是仅存在所有项目.
>我使用“pop”来显示它是如何在行号中移动的,但实际上你并不想在每次搜索时都破坏你的索引.
>我假设这些数字都符合这里的整数.如果您的数字非常庞大,则将其扩展为long或甚至将每个数字的字符串表示映射到int.
>有一些加速,特别是在“流行”阶段跳过线,但我去了更明确的解释.

无论是使用此方法还是其他方法,您还可以根据数据进行计算.单个通道可以确定每条线是上升,下降,所有奇数,所有偶数,还是最高和最低数字都可以用来减少每个子字符串的搜索空间.这些是否有用完全取决于您的数据集.

（编辑：李大同）

【声明】本站内容均来自网络，其相关言论仅代表作者个人观点，不代表本站立场。若无意侵犯到您的权利，请及时与联系站长删除相关内容!