Linux上的“快速选择”(或类似)实现? (而不是排序| uniq -c | s
问题:我经常需要查看特定日志最后一天内最常重复的“模式”.就像这里的一小部分tomcat日志一样:
GET /app1/public/pkg_e/v3/555413242345562/account/stats 401 954 5 GET /app1/public/pkg_e/v3/555412562561928/account/stats 200 954 97 GET /app1/secure/pkg_e/v3/555416251626403/ex/items/ 200 517 18 GET /app1/secure/pkg_e/v3/555412564516032/ex/cycle/items 200 32839 50 DELETE /app1/internal/pkg_e/v3/accounts/555411543532089/devices/bbbbbbbb-cccc-2000-dddd-43a8eabcdaa0 404 - 1 GET /app1/secure/pkg_e/v3/555412465246556/sessions 200 947 40 GET /app1/public/pkg_e/v3/555416264256223/account/stats 401 954 4 GET /app2/provisioning/v3/555412562561928/devices 200 1643 65 ... 如果我想找出最常用的URL(以及方法和重新编码) – 我会这样做: [root@srv112:~]$N=6;cat test|awk '{print $1" "$2" ("$3")"}' |sed 's/[0-9a-f-]+ (/%GUID% (/;s//[0-9]{4,}///%USERNAME%//' |sort|uniq -c|sort -rn|head -$N 4 GET /app1/public/pkg_e/v3/%USERNAME%/account/stats (401) 2 GET /app1/secure/pkg_e/v3/%USERNAME%/devices (200) 2 GET /app1/public/pkg_e/v3/%USERNAME%/account/stats (200) 2 DELETE /app1/internal/pkg_e/v3/accounts/%USERNAME%/devices/%GUID% (404) 1 POST /app2/servlet/handler (200) 1 POST /app1/servlet/handler (200) 如果我想从同一个文件中找出最频繁的用户名 – 我会这样做: [root@srv112:~]$N=4;cat test|grep -Po '(?<=/)[0-9]{4,}(?=/)' |sort|uniq -c|sort -rn|head -$N 9 555412562561928 2 555411543532089 1 555417257243373 1 555416264256223 以上在小型数据集上工作得很好,但是对于更大的输入集 – sort | uniq -c | sort -rn | head – $N的性能(复杂性)是难以忍受的(谈论~100台服务器,~250每台服务器的日志文件,每个日志文件约1mln行) 尝试解决:| sort | uniq -c部件可以很容易地用awk 1-liner替换,将其转换为: |awk '{S[$0]+=1}END{for(i in S)print S[i]"t"i}'|sort -rn|head -$N 但我没有找到标准/简单和内存效率高的“快速选择算法”(讨论here)来优化| sort -rn | head – $N部分. 3 tasty oranges 225 magic balls 17 happy dolls 15 misty clouds 93 juicy melons 55 rusty ideas ... 进(给定N = 3): 225 magic balls 93 juicy melons 55 rusty ideas 我可能可以获取示例Java代码并将其移植到上面的stdin格式(顺便说一下 – 对核心java中缺少.quickselect(…)感到惊讶) – 但是在任何地方部署java-runtime的需求并不吸引人. 其他障碍:还面临一些问题需要解决大数据集: >日志文件位于NFS安装的约100台服务器卷上 – 所以它 我目前的解决方案仍然不可靠,并且不是最佳(正在进行中),如下所示: find /logs/mount/srv*/tomcat/2013-09-24/ -type f -name "*_22:*"| # TODO: reorder 'find' output to round-robin through srv1 srv2 ... # to help 'parallel' work with multiple servers at once parallel -P20 $"zgrep -Po '[my pattern-grep regexp]' {} |awk '{S[$0]+=1} END{for(i in S)if(S[i]>4)print "count: "S[i]"n"i}'" # I throw away patterns met less than 5 times per log file # in hope those won't pop on top of result list anyway - bogus # but helps to address 16GB-mem problem for 'awk' below awk '{if("count:"==$1){C=$2}else{S[$0]+=C}} END{for(i in S)if(S[i]>99)print S[i]"t"i}'| # I also skip all patterns which are met less than 100 times # the hope that these won't be on top of the list is quite reliable sort -rn|head -$N # above line is the inefficient one I strive to address 解决方法
我不确定你是否可以编写自己的小工具,但你可以轻松编写一个小工具来替换| sort | uniq -c | sort -rn | head – $N-part with | sort | quickselect $N.该工具的好处是,它只能逐行读取第一个排序的输出,并且不会在内存中保留太多数据.实际上,只需要内存来保存当前行和前N行,然后打印.
这是源quickselect.cpp: #include <iostream> #include <string> #include <map> #include <cstdlib> #include <cassert> typedef std::multimap< std::size_t,std::string,std::greater< std::size_t > > winner_t; winner_t winner; std::size_t max; void insert( int count,const std::string& line ) { winner.insert( winner_t::value_type( count,line ) ); if( winner.size() > max ) winner.erase( --winner.end() ); } int main( int argc,char** argv ) { assert( argc == 2 ); max = std::atol( argv[1] ); assert( max > 0 ); std::string current,last; std::size_t count = 0; while( std::getline( std::cin,current ) ) { if( current != last ) { insert( count,last ); count = 1; last = current; } else ++count; } if( count ) insert( count,current ); for( winner_t::iterator it = winner.begin(); it != winner.end(); ++it ) std::cout << it->first << " " << it->second << std::endl; } 编译: g++ -O3 quickselect.cpp -o quickselect 是的,我确实意识到你要求开箱即用的解决方案,但我不知道任何同样有效的方法.以上是如此简单,几乎没有任何错误的余地(假设您没有弄乱单个数字命令行参数:) (编辑:李大同) 【声明】本站内容均来自网络,其相关言论仅代表作者个人观点,不代表本站立场。若无意侵犯到您的权利,请及时与联系站长删除相关内容! |