加入收藏 | 设为首页 | 会员中心 | 我要投稿 李大同 (https://www.lidatong.com.cn/)- 科技、建站、经验、云计算、5G、大数据,站长网!
当前位置: 首页 > 编程开发 > Python > 正文

python – 将不规则的列表字典转换为pandas数据帧

发布时间:2020-12-16 22:50:43 所属栏目:Python 来源:网络整理
导读:(或列表清单我刚刚编辑过) 是否存在用于转换此类结构的现有python / pandas方法 food2 = {}food2["apple"] = ["fruit","round"]food2["bananna"] = ["fruit","yellow","long"]food2["carrot"] = ["veg","orange","long"]food2["raddish"] = ["veg","red"] 进

(或列表清单……我刚刚编辑过)

是否存在用于转换此类结构的现有python / pandas方法

food2 = {}
food2["apple"]   = ["fruit","round"]
food2["bananna"] = ["fruit","yellow","long"]
food2["carrot"]  = ["veg","orange","long"]
food2["raddish"] = ["veg","red"]

进入像这样的数据透视表?

+---------+-------+-----+-------+------+--------+--------+-----+
|         | fruit | veg | round | long | yellow | orange | red |
+---------+-------+-----+-------+------+--------+--------+-----+
| apple   | 1     |     | 1     |      |        |        |     |
+---------+-------+-----+-------+------+--------+--------+-----+
| bananna | 1     |     |       | 1    | 1      |        |     |
+---------+-------+-----+-------+------+--------+--------+-----+
| carrot  |       | 1   |       | 1    |        | 1      |     |
+---------+-------+-----+-------+------+--------+--------+-----+
| raddish |       | 1   |       |      |        |        | 1   |
+---------+-------+-----+-------+------+--------+--------+-----+

天真的,我可能只是循环通过字典.我看到如何在每个内部列表上使用地图,但我不知道如何在字典上加入/堆叠它们.一旦我加入了它们,我就可以使用pandas.pivot_table了

for key in food2:
    attrlist = food2[key]
    onefruit_pairs = map(lambda x: [key,x],attrlist)
    one_fruit_frame = pd.DataFrame(onefruit_pairs,columns=['fruit','attr'])
    print(one_fruit_frame)

     fruit    attr
0  bananna   fruit
1  bananna  yellow
2  bananna    long
    fruit    attr
0  carrot     veg
1  carrot  orange
2  carrot    long
   fruit   attr
0  apple  fruit
1  apple  round
     fruit attr
0  raddish  veg
1  raddish  red
最佳答案
纯Python:

from itertools import chain

def count(d):
    cols = set(chain(*d.values()))
    yield ['name'] + list(cols)
    for row,values in d.items():
        yield [row] + [(col in values) for col in cols]

测试:

>>> food2 = {           
    "apple": ["fruit","round"],"bananna": ["fruit","long"],"carrot": ["veg","raddish": ["veg","red"]
}

>>> list(count(food2))
[['name','long','veg','fruit','yellow','orange','round','red'],['bananna',True,False,False],['carrot',['apple',['raddish',True]]

[更新]

性能测试:

>>> from itertools import product
>>> labels = list("".join(_) for _ in product(*(["ABCDEF"] * 7)))
>>> attrs = labels[:1000]
>>> import random
>>> sample = {}
>>> for k in labels:
...     sample[k] = random.sample(attrs,5)
>>> import time
>>> n = time.time(); list(count(sample)); print time.time() - n                                                                
62.0367980003

在我忙碌的机器上花了不到2分钟,因为279936行乘1000列(打开了很多镀铬标签).如果表现不可接受,请告诉我.

[更新]

从另一个答案测试性能:

>>> n = time.time(); 
...     df = pd.DataFrame(dict([(k,pd.Series(v)) for k,v in sample.items()])); 
...     print time.time() - n
72.0512290001

下一行(df = pd.melt(…))花了太长时间,所以我取消了测试.拿这个结果用一粒盐,因为它在繁忙的机器上运行.

(编辑:李大同)

【声明】本站内容均来自网络,其相关言论仅代表作者个人观点,不代表本站立场。若无意侵犯到您的权利,请及时与联系站长删除相关内容!

    推荐文章
      热点阅读