需要在python中比较1.5GB左右的非常大的文件
"DF","00000000@22222.COM","FLTINT1000130394756","26JUL2010","B2C","6799.2" "Rail","00000.POO@GMAIL.COM","NR251764697478","24JUN2011","2025" "DF","0000650000@YAHOO.COM","NF2513521438550","01JAN2013","6792" "Bus","00009.GAURAV@GMAIL.COM","NU27012932319739","26JAN2013","800" "Rail","0000.ANU@GMAIL.COM","NR251764697526","595" "Rail","0000MANNU@GMAIL.COM","NR251277005737","29OCT2011","957" "Rail","0000PRANNOY0000@GMAIL.COM","NR251297862893","21NOV2011","212" "DF","0000PRANNOY0000@YAHOO.CO.IN","NF251327485543","26JUN2011","17080" "Rail","0000RAHUL@GMAIL.COM","NR2512012069809","25OCT2012","5731" "DF","0000SS0@GMAIL.COM","NF251355775967","10MAY2011","2000" "DF","0001HARISH@GMAIL.COM","NF251352240086","22DEC2010","4006" "DF","NF251742087846","12DEC2010","1000" "DF","NF252022031180","09DEC2010","3439" "Rail","000AYUSH@GMAIL.COM","NR2151120122283","25JAN2013","136" "Rail","NR2151213260036","28NOV2012","41" "Rail","NR2151313264432","29NOV2012","96" "Rail","NR2151413266728","NR2512912359037","08DEC2012","NR2517612385569","12DEC2012","96" 以上是样本数据. 我希望在另一个csv文件中输出这样的东西 "DF","6799.2",1,0 days "Rail","2025",0 days "DF","6792",0 days "Bus","800","595","957","212","17080","5731","2000","4006","1000",2,3 days "DF","3439",3,10 days "Rail","41","96",1 days "Rail",4,9 days "Rail",5,6,4 days "Rail",7,"136",8,44 days "Rail",9,0 days 即如果第一次进入,我需要追加1如果它发生第二次我需要追加2同样我的意思是我需要计算文件中的电子邮件地址的出现次数,如果电子邮件存在两次或更多我想要区别日期和记住日期之间没有排序所以我们必须根据特定的电子邮件地址对它们进行排序,我正在寻找python中的解决方案,使用numpy或pandas库或任何其他可以处理这种类型的大数据的库而不放弃绑定内存异常我有双核处理器与centos 6.3和4GB的内存 解决方法
另一种可能的(系统管理员)方式,避免数据库和SQL查询以及运行时进程和硬件资源中的大量需求.
更新20/04添加了更多代码和简化方法: – > Convert the timestamp到秒(来自Epoch)并使用UNIX排序,使用电子邮件和这个新字段(即:sort -k2 -k4 -n -t,< converted_input_file> output_file) > a)如果是同一封电子邮件,不同的时间戳:计算天数,增加COUNT = 1,更新PREV_TIME,添加“Count,Difference_in_days” 替代1.是添加一个新字段TIMESTAMP并在打印出行时将其删除. 注意:如果1.5GB太大而无法随意排序,请将其拆分为较小的卡盘,使用电子邮件作为分割点.您可以在不同的计算机上并行运行这些块 /usr/bin/gawk -F'","' ' { split("JAN FEB MAR APR MAY JUN JUL AUG SEP OCT NOV DEC",month," "); for (i=1; i<=12; i++) mdigit[month[i]]=i; print $0 "," mktime(substr($4,4) " " mdigit[substr($4,3)] " " substr($4,2) " 00 00 00" )}' < input.txt | /usr/bin/sort -k2 -k7 -n -t,> output_file.txt output_file.txt:
您将输出通过管道传输到Perl,Python或AWK脚本以处理步骤2到步骤4. (编辑:李大同) 【声明】本站内容均来自网络,其相关言论仅代表作者个人观点,不代表本站立场。若无意侵犯到您的权利,请及时与联系站长删除相关内容! |