bash – 是否有grep的反转：在长模式中找到短行？

发布时间：2020-12-15 22:26:56 所属栏目：安全来源：网络整理

导读：在grep从查找文件的长行中的模式文件中找到一个短模式的地方,我需要一个工具来允许我提取可以在更长模式中找到的查找文件的短行. 换句话说,鉴于莎士比亚的作品每行一句并说法语词典,我想找到哪些法语单词在莎士比亚的哪一行中找到,从而可以发现莎士比亚的一

在grep从查找文件的长行中的模式文件中找到一个短模式的地方,我需要一个工具来允许我提取可以在更长模式中找到的查找文件的短行.

换句话说,鉴于莎士比亚的作品每行一句并说法语词典,我想找到哪些法语单词在莎士比亚的哪一行中找到,从而可以发现莎士比亚的一行可能包含更多的事实不止一个法语单词,而且法语单词可能出现在莎士比亚的不止一行中.

例如：

pattern_file={
"The sun is shining!"
"It is a beautiful day!"}

lookup_file={
"Rain"
"Sun"
"Cloud"
"Beautiful"
"Shining"}

我想要的是什么

function file pattern

给出在较长模式中找到的行和由昏迷分隔的较长模式本身,同时检测多个匹配.

ideal_result_file={
"Sun","The sun is shining!"
"Beautiful","It is a beautiful day!","Shining","The sun is shining!"}

目前,我使用grep逐行遍历整个查找文件：

while read line
    do
      grep  -is $line pattern_file | sed 's/^/'"$line"',/g' >> result_file.csv
    done < lookup_file

这非常慢！我的lookup_file包含超过50 000行,而我的pattern_file包含500.使用grep在lookup_file中查找更短的模式需要几秒钟,使用我的循环方法的单次传递需要一天/几周.

任何语言的解决方案将不胜感激.

有点相关
Very slow loop using grep or fgrep on large datasets
Is Perl faster than bash?

该解决方案需要与GB大小的循环和模式文件兼容.

解决方法

使用哈希表或集合(取决于您的语言)以全部小写形式存储字典.对于每一行,将行拆分为基于非字母字符的单词数组.根据这些单词构建一个微型哈希表,转换为小写,以消除重复.遍历该微型哈希表中的每个单词,验证它是否存在于您的字典哈希表中.如果存在,则打印单词和整行.

这是Perl中的一个实现.

#! /usr/bin/perl

my $dictFile=$ARGV[0];
my $srchFile=$ARGV[1];
(-f $dictFile and -f $srchFile) or die "Usage: $0 dictFile srchFile";

# Load dictionary into hash table
my %dict=();
open($df,"<$dictFile") or die "Cannot open $dictFile";
while (<$df>) {
  chomp;
  $dict{lc($_)}=1;
}

# Search file for your dictionary words
open($sf,"<$srchFile") or die "Cannot open $srchFile";
my $lineNo=0;
while ($line=<$sf>) {
  $lineNo++;
  chomp($line);
  my %words=();
  my @sentence=split(/[^a-zA-Zà-?0-9]+/,$line);
  foreach $word (@sentence) {
    $words{lc($word)}=1;
  }
  while ( my ($key) = each(%words) ) {
    if ($dict{$key}) {
      print "$lineNo,$key,$linen";
    }
  }
}

pattern.txt

The sun is shining!
It is a beautiful day!

lookup.txt

Rain
Sun
Cloud
Beautiful
Shining

$./deepfind lookup.txt pattern.txt

1,shining,The sun is shining!
1,sun,The sun is shining!
2,beautiful,It is a beautiful day!

编辑：根据您的意见,这里是另一种方法来定义“句子”中的“单词”集.这准备了与字典中找到的任何序列的长度匹配的所有可行序列.

#! /usr/bin/perl
my $dictFile=$ARGV[0];
my $srchFile=$ARGV[1];
(-f $dictFile and -f $srchFile) or die "Usage: $0 dictFile srchFile";
# Load sequence dictionary into hash table
my %dict=();
my %sizes=();
open($df,"<$dictFile") or die "Cannot open $dictFile";
while (<$df>) {
  chomp;
  $dict{lc($_)}=1;
  $sizes{length($_)}=1;
}

# Search file for known sequences
open($sf,"<$srchFile") or die "Cannot open $srchFile";
my $lineNo=0;
while ($line=<$sf>) {
  $lineNo++;
  chomp($line);
  # Populate a hash table with every unique sequence that could be matched
  my %sequences=();
  while ( my ($size) = each(%sizes) ) {
    for (my $i=0; $i <= length($line)-$size; $i++) {
      $sequences{substr($line,$i,$size)}=1;
    }
  }
  # Compare each sequence with the dictionary of sequences.
  while ( my ($sequence) = each(%sequences) ) {
    if ($dict{$sequence}) {
      print "$lineNo,$sequence,$linen";
    }
  }
}

（编辑：李大同）

【声明】本站内容均来自网络，其相关言论仅代表作者个人观点，不代表本站立场。若无意侵犯到您的权利，请及时与联系站长删除相关内容!