加快庞大数据集的Python文件处理速度
我有一个大型数据集存储为17GB的csv文件(fileData),其中包含每个customer_id的可变数量的记录(最多约30,000).我正在尝试搜索特定客户(在fileSelection中列出 – 总共90000中的大约1500个)并将每个客户的记录复制到单独的csv文件(fileOutput)中.
我是Python的新手,但使用它因为vba和matlab(我比较熟悉)无法处理文件大小. (我正在使用Aptana studio编写代码,但是直接从cmd行运行python以获得速度.运行64位Windows 7.) 我写的代码是提取一些客户,但有两个问题: 这是代码: `def main(): # Initialisation : # - identify columns in slection file # fS = open (fileSelection,"r") if fS.mode == "r": header = fS.readline() selheaderlist = header.split(",") custkey = selheaderlist.index('CUSTOMER_KEY') # # Identify columns in dataset file fileData = path2+file_data fD = open (fileData,"r") if fD.mode == "r": header = fD.readline() dataheaderlist = header.split(",") custID = dataheaderlist.index('CUSTOMER_ID') fD.close() # For each customer in the selection file customercount=1 for sr in fS: # Find customer key and locate it in customer ID field in dataset selrecord = sr.split(",") requiredcustomer = selrecord[custkey] #Look for required customer in dataset found = 0 fD = open (fileData,"r") if fD.mode == "r": while found == 0: dr = fD.readline() if not dr: break datrecord = dr.split(",") if datrecord[custID] == requiredcustomer: found = 1 # Open outputfile fileOutput= path3+file_out_root + str(requiredcustomer)+ ".csv" fO=open(fileOutput,"w+") fO.write(str(header)) #copy all records for required customer number while datrecord[custID] == requiredcustomer: fO.write(str(dr)) dr = fD.readline() datrecord = dr.split(",") #Close Output file fO.close() if found == 1: print ("Customer Count "+str(customercount)+ " Customer ID"+str(requiredcustomer)+" copied. ") customercount = customercount+1 else: print("Customer ID"+str(requiredcustomer)+" not found in dataset") fL.write (str(requiredcustomer)+","+"NOT FOUND") fD.close() fS.close() ` 提取几百个客户需要几天时间,但却找不到更多. Sample Output 谢谢@Paul Cornelius.这样效率更高.我采用了你的方法,也使用@Bernardo建议的csv处理: # Import Modules import csv def main(): # Initialisation : fileSelection = path1+file_selection fileData = path2+file_data # Step through selection file and create dictionary with required ID's as keys,and empty objects with open(fileSelection,'rb') as csvfile: selected_IDs = csv.reader(csvfile) ID_dict = {} for row in selected_IDs: ID_dict.update({row[1]:[]}) # step through data file: for selected customer ID's,append records to dictionary objects with open(fileData,'rb') as csvfile: dataset = csv.reader(csvfile) for row in dataset: if row[0] in ID_dict: ID_dict[row[0]].extend([row[1]+','+row[4]]) # write all dictionary objects to csv files for row in ID_dict.keys(): fileOutput = path3+file_out_root+row+'.csv' with open(fileOutput,'wb') as csvfile: output = csv.writer(csvfile,delimiter='n') output.writerows([ID_dict[row]]) 解决方法
对于一个简单的答案,任务太过牵扯.但是你的方法非常低效,因为你有太多的嵌套循环.尝试通过客户列表进行一次传递,并为每个构建一个“客户”对象,其中包含您稍后需要使用的任何信息.你把它们放在字典里;键是不同的必需客户变量,值是客户对象.如果我是你,我会先让这部分工作,然后再对大文件进行愚弄.
现在,您将逐步浏览大量客户数据文件,每次遇到其datarecord [custID]字段位于字典中的记录时,都会在输出文件中附加一行.您可以使用相对高效的运算符来测试字典中的成员资格. 不需要嵌套循环. 您呈现它的代码无法运行,因为您在不打开它的情况下写入名为fL的某个对象.另外,正如Tim Pietzcker指出的那样,你没有关闭你的文件,因为你实际上没有调用close函数. (编辑:李大同) 【声明】本站内容均来自网络,其相关言论仅代表作者个人观点,不代表本站立场。若无意侵犯到您的权利,请及时与联系站长删除相关内容! |