在Python中优化用于创建一起评级的项目列表的算法
给出购买事件列表(customer_id,item)
1-hammer 1-screwdriver 1-nails 2-hammer 2-nails 3-screws 3-screwdriver 4-nails 4-screws 我正在尝试构建一个数据结构,告诉我用另一个项目购买商品的次数.不是同时买的,而是因为我开始保存数据而买的.结果看起来像 { hammer : {screwdriver : 1,nails : 2},screwdriver : {hammer : 1,screws : 1,nails : 1},screws : {screwdriver : 1,nails : {hammer : 1,screwdriver : 1} } 表示用钉子两次(人1,3)和一把螺丝刀(人1)买了一把锤子,用螺丝刀买了一次螺钉(人3),依此类推…… 我目前的做法是 users = dict其中userid是键,而购买的项目列表是值 pseudo: for each event(customer,item)(sorted by item): add user to users dict if not exists,and add the items add item to items dict if not exists,and add the user ---------- for item,user in rows: # add the user to the users dict if they don't already exist. users[user]=users.get(user,[]) # append the current item_id to the list of items rated by the current user users[user].append(item) if item != last_item: # we just started a new item which means we just finished processing an item # write the userlist for the last item to the usersForItem dictionary. if last_item != None: usersForItem[last_item]=userlist userlist=[user] last_item = item items.append(item) else: userlist.append(user) usersForItem[last_item]=userlist 所以,在这一点上,我有两个决定 – 谁买了什么,以及谁买了什么.这是它变得棘手的地方.现在填充了usersForItem,我遍历它,遍历购买该项目的每个用户,并查看用户的其他购买.我承认这不是最狡猾的做事方式 – 我试图确保在得到Python之前得到正确的结果(我是). relatedItems = {} for key,listOfUsers in usersForItem.iteritems(): relatedItems[key]={} related=[] for ux in listOfReaders: for itemRead in users[ux]: if itemRead != key: if itemRead not in related: related.append(itemRead) relatedItems[key][itemRead]= relatedItems[key].get(itemRead,0) + 1 calc jaccard/tanimoto similarity between relatedItems[key] and its values 有没有更有效的方法可以做到这一点?此外,如果这种类型的操作有适当的学术名称,我很乐意听到它. 编辑:澄清包括这样一个事实,即我不会限制购买同时购买的物品.物品可以随时购买. 解决方法events = """ 1-hammer 1-screwdriver 1-nails 2-hammer 2-nails 3-screws 3-screwdriver 4-nails 4-screws""".splitlines() events = sorted(map(str.strip,e.split('-')) for e in events) from collections import defaultdict from itertools import groupby # tally each occurrence of each pair of items summary = defaultdict(int) for val,items in groupby(events,key=lambda x:x[0]): items = sorted(it[1] for it in items) for i,item1 in enumerate(items): for item2 in items[i+1:]: summary[(item1,item2)] += 1 summary[(item2,item1)] += 1 # now convert raw pair counts into friendlier lookup table pairmap = defaultdict(dict) for k,v in summary.items(): item1,item2 = k pairmap[item1][item2] = v # print the results for k,v in sorted(pairmap.items()): print k,':',v 得到: hammer : {'nails': 2,'screwdriver': 1} nails : {'screws': 1,'hammer': 2,'screwdriver': 1} screwdriver : {'screws': 1,'nails': 1,'hammer': 1} screws : {'nails': 1,'screwdriver': 1} (这通过购买事件解决您的初始请求分组项目.要按用户分组,只需将事件列表的第一个键从事件编号更改为用户ID.) (编辑:李大同) 【声明】本站内容均来自网络,其相关言论仅代表作者个人观点,不代表本站立场。若无意侵犯到您的权利,请及时与联系站长删除相关内容! |