Perl在不同情况下找到有效的线对

发布时间：2020-12-16 06:16:56 所属栏目：大数据来源：网络整理

导读：我有每个GET / POST以制表符分隔形式的HTTP标头请求和回复数据,并以不同的行回复.此数据使得一个TCP流有多个GET,POST和REPLY.我需要在这些情况下只选择第一个有效的GET – REPLY对.一个例子(简化)是： ID Source Dest Bytes Type Content-Length host lines.

我有每个GET / POST以制表符分隔形式的HTTP标头请求和回复数据,并以不同的行回复.此数据使得一个TCP流有多个GET,POST和REPLY.我需要在这些情况下只选择第一个有效的GET – REPLY对.一个例子(简化)是：

ID       Source    Dest    Bytes   Type   Content-Length  host               lines.... 
1         A         B       10     GET        NA          yahoo.com            2
1         A         B       10     REPLY      10          NA                   2 
2         C         D       40     GET        NA          google.com           4
2         C         D       40     REPLY      20          NA                   4
2         C         D       40     GET        NA          google.com           4
2         C         D       40     REPLY      30          NA                   4
3         A         B       250    POST       NA          mail.yahoo.com       5
3         A         B       250    REPLY      NA          NA                   5
3         A         B       250    REPLY      15          NA                   5
3         A         B       250    GET        NA          yimg.com             5
3         A         B       250    REPLY      35          NA                   5
4         G         H       415    REPLY      10          NA                   6
4         G         H       415    POST       NA          facebook.com         6
4         G         H       415    REPLY      NA          NA                   6
4         G         H       415    REPLY      NA          NA                   6
4         G         H       415    GET        NA          photos.facebook.com  6
4         G         H       415    REPLY      50          NA                   6

....

所以,基本上我需要为每个ID获取一个请求 – 回复对并将它们写入新文件.

对于’1′,它只是一对,所以很容易.但也存在错误情况,两行都是GET,POST或REPLY.所以,这种情况被忽略了.

对于’2′,我会选择第一个GET – REPLY对.

对于’3′,我会选择第一个GET但是第二个REPLY,因为Content-Length在第一个中没有(使子请求REPLY成为更好的候选者).

对于’4′,我会选择第一个POST(或GET),因为第一个头不能是REPLY.即使在POST之后缺少内容长度,我也不会在第二次GET之后选择REPLY,因为REPLY在此之后出现.所以我会选择第一个REPLY.

因此,在选择最佳请求和回复对之后,我需要将它们配对在一行中.例如,输出将是：

ID       Source    Dest    Bytes   Type   Content-Length  host         .... 
   1         A         B       10     GET      10          yahoo.com
   2         C         D       40     GET      20          google.com
   3         A         B       250    POST     15          mail.yahoo.com
   4         G         H       415    POST     NA          facebook.com

实际数据中有很多其他标题,但这个例子几乎显示了我需要的内容.如何在Perl中做到这一点？我几乎陷入困境,所以我一次只能读取一行文件.

open F,"<","file.txt" || die "Cannot open $f: $!";

  while (<F>) {
    chomp;
    my @line = split /t/;


      # get the valid pairs for cases with multiple request - replies


      # get the paired up data together

  }
  close (F);

*编辑：我添加了一个额外的列,给出了每个ID的HTTP标题行数.这可能有助于了解要检查的后续行数.此外,我修改了ID’4′,以便第一个标题行是REPLY. *

解决方法

下面的程序做我认为你需要的.

它被评论,我认为它是相当清晰的.请问是否有任何不清楚的地方.

use strict;
use warnings;

use List::Util 'max';

my $file = $ARGV[0] // 'file.txt';
open my $fh,'<',$file or die qq(Unable to open "$file" for reading: $!);

# Read the field names from the first line to index the hashes
# Remember where the data in the file starts so we can get back here
#
my @fields = split ' ',<$fh>;
my $start = tell $fh;

# Build a format to print the accumulated data
# Create a hash that relates column headers to their widths
#
my @headers = qw/ ID Source Dest Bytes Type Content-Length host /;
my %len = map { $_ => length } @headers;

# Read through the file to find the maximum data width for each column
#
while (<$fh>) {
  my %data;
  @data{@fields} = split;
  next unless $data{ID} =~ /^d/;
  $len{$_} = max($len{$_},length $data{$_}) for @headers;
}

# Build a format string using the values calculated
#
my $format = join '   ',map sprintf('%%%ds',$_),@len{@headers};
$format .= "n";

# Go back to the start of the data
# Print the column headers
#
seek $fh,$start,0;
printf $format,@headers;

# Build transaction data hashes into $record and print them
# Ignore any events before the first request
# Ignore the second request and anything after it
# Update the stored Content-Length field if a value other than NA appears
#
my $record;
my $nreq = 0;

while (<$fh>) {

  my %data;
  @data{@fields} = split;
  my ($id,$type) = @data{ qw/ ID Type / };
  next unless $id =~ /^d/;

  if ($record and $id ne $record->{ID}) {
    printf $format,@{$record}{@headers};
    undef $record;
    $nreq = 0;
  }

  if ($type eq 'GET' or $type eq 'POST') {
    $record = %data if $nreq == 0;
    $nreq++;
  }
  elsif ($nreq == 1) {
    if ($record->{'Content-Length'} eq 'NA' and $data{'Content-Length'} ne 'NA') {
      $record->{'Content-Length'} = $data{'Content-Length'};
    }
  }
}

printf $format,@{$record}{@headers} if $record;

产量

根据问题中给出的数据,该程序产生

ID   Source   Dest   Bytes    Type   Content-Length                  host
 1        A      B      10     GET               10             yahoo.com
 2        C      D      40     GET               20            google.com
 3        A      B     250    POST               15        mail.yahoo.com
 4        G      H     415    POST               NA          facebook.com

（编辑：李大同）

【声明】本站内容均来自网络，其相关言论仅代表作者个人观点，不代表本站立场。若无意侵犯到您的权利，请及时与联系站长删除相关内容!