python – 仅用于频繁值的热编码

发布时间：2020-12-20 12:00:49 所属栏目：Python 来源：网络整理

导读：我希望对列进行一次热编码,但仅针对那些非常频繁的编码.所有低于阈值T的都将被放入他们自己的类别中. 我的策略是创建一个“名字”字典 – “频率”.然后将频率转换为字符串.如果字符串不常见,则应使用某些描述性字符串替换它.优选地,我想要具有两个区域/阈值

我希望对列进行一次热编码,但仅针对那些非常频繁的编码.所有低于阈值T的都将被放入他们自己的类别中.

我的策略是创建一个“名字”字典 – > “频率”.然后将频率转换为字符串.如果字符串不常见,则应使用某些描述性字符串替换它.优选地,我想要具有两个区域/阈值：“less_common”和“rare”或类似的东西.

这是我目前的尝试.我把它分成几行只是为了调试fyi.第3行不起作用.我在Python 3.6中使用conda,

tmp = df["name"].groupby(df["name"])
tmp = tmp.agg(['count'])
tmp['count'] = tmp["count"].apply(lambda x: "Uncommon" if tmp["count"] < 1000.0 else str(x) )
labelDict = tmp.to_dict()
#some code?
df[columnName].replace(labelDict,inplace=True)
pd.get_dummies(df,columns=['name'])

错误：

ValueError: The truth value of a Series is ambiguous. Use a.empty,a.bool(),a.item(),a.any() or a.all().

一些示例输入(还有其他列)：
name = a,a,b,c,d

这变成了

name | count
a | 4
b | 3
c | 2
d | 1

Let's say T is =<2
dict:
a->4,b->3,c->"Uncommon",d->"Uncommon"

Remap dict to use the original values if name is numeric:
a->"a",b->"b",d->"Uncommon"

As one hot:
date | id | name_a | name_b | name_Uncommon 
...  | ...|  1     | 0      | 0
...

当前的lib版本：

['alabaster==0.7.10','anaconda-client==1.6.3','anaconda-navigator==1.6.2','anaconda-project==0.6.0','appnope==0.1.0','appscript==1.0.1','asn1crypto==0.22.0','astroid==1.4.9','astropy==1.3.2','babel==2.4.0','backports.shutil-get-terminal-size==1.0.0','beautifulsoup4==4.6.0','bitarray==0.8.1','blaze==0.10.1','bleach==1.5.0','bokeh==0.12.5','boto==2.46.1','bottleneck==1.2.1','branca==0.2.0','cffi==1.10.0','chardet==3.0.3','chest==0.2.3','click==6.7','cloudpickle==0.2.2','clyent==1.2.2','colorama==0.3.9','conda==4.3.21','configobj==5.0.6','contextlib2==0.5.5','cryptography==1.8.1','cycler==0.10.0','cython==0.25.2','cytoolz==0.8.2','dask==0.14.3','datashape==0.5.4','decorator==4.0.11','dill==0.2.6','distributed==1.16.3','docutils==0.13.1','entrypoints==0.2.2','et-xmlfile==1.0.1','fastcache==1.0.2','flask-cors==3.0.2','flask==0.12.2','folium==0.3.0','gevent==1.2.1','greenlet==0.4.12','h5py==2.7.0','heapdict==1.0.0','html5lib==0.9999999','idna==2.5','imagesize==0.7.1','ipykernel==4.6.1','ipython-genutils==0.2.0','ipython==5.3.0','ipywidgets==6.0.0','isort==4.2.5','itsdangerous==0.24','jdcal==1.3','jedi==0.10.2','jinja2==2.9.6','jsonschema==2.6.0','jupyter-client==5.0.1','jupyter-console==5.1.0','jupyter-core==4.3.0','jupyter==1.0.0','keras==2.0.4','lazy-object-proxy==1.2.2','llvmlite==0.18.0','locket==0.2.0','lxml==3.7.3','mako==1.0.6','markdown==2.2.0','markupsafe==0.23','matplotlib==2.0.2','mistune==0.7.4','mpmath==0.19','msgpack-python==0.4.8','multipledispatch==0.4.9','navigator-updater==0.1.0','nbconvert==5.1.1','nbformat==4.3.0','networkx==1.11','nltk==3.2.3','nose==1.3.7','notebook==5.0.0','numba==0.33.0','numexpr==2.6.2','numpy==1.12.1','numpydoc==0.6.0','odo==0.5.0','olefile==0.44','openpyxl==2.4.7','packaging==16.8','pandas==0.20.1','pandocfilters==1.4.1','partd==0.3.8','pathlib2==2.2.1','patsy==0.4.1','pep8==1.7.0','pexpect==4.2.1','pickleshare==0.7.4','pillow==4.1.1','pip==9.0.1','ply==3.10','prompt-toolkit==1.0.14','protobuf==3.3.0','psutil==5.2.2','ptyprocess==0.5.1','py==1.4.33','pyasn1==0.2.3','pycosat==0.6.2','pycparser==2.17','pycrypto==2.6.1','pycurl==7.43.0','pyflakes==1.5.0','pygments==2.2.0','pygpu==0.6.4','pylint==1.6.4','pyodbc==4.0.16','pyopenssl==17.0.0','pyparsing==2.1.4','pytest==3.0.7','python-dateutil==2.6.0','pytz==2017.2','pywavelets==0.5.2','pyyaml==3.12','pyzmq==16.0.2','qtawesome==0.4.4','qtconsole==4.3.0','qtpy==1.2.1','redis==2.10.5','requests==2.14.2','rope-py3k==0.9.4.post1','scikit-image==0.13.0','scikit-learn==0.18.1','scipy==0.19.0','seaborn==0.7.1','setuptools==27.2.0','simplegeneric==0.8.1','singledispatch==3.4.0.3','six==1.10.0','snowballstemmer==1.2.1','sockjs-tornado==1.0.3','sortedcollections==0.5.3','sortedcontainers==1.5.7','sphinx==1.5.6','spyder==3.1.4','sqlalchemy==1.1.9','statsmodels==0.8.0','sympy==1.0','tables==3.3.0','tblib==1.3.2','tensorflow==1.2.0rc1','terminado==0.6','testpath==0.3','tflearn==0.3.1','theano==0.9.0','toolz==0.8.2','tornado==4.5.1','traitlets==4.3.2','unicodecsv==0.14.1','wcwidth==0.1.7','werkzeug==0.12.2','wheel==0.29.0','widgetsnbextension==2.0.0','wrapt==1.10.10','xgboost==0.6','xlrd==1.0.0','xlsxwriter==0.9.6','xlwings==0.10.4','xlwt==1.2.0','zict==0.1.2']

我承认我找到了一个相关的解决方案,但目前尚不清楚如何修改它以满足我的需求.问题是你不能在“第一”列上使用值{a,…}进行热点,然后在“第二”列上进行一次热,也可能具有值{a,…}并按值标记这些列.我会有一个名字冲突. Pandas One hot encoding: Bundling together less frequent categories

解决方法

考虑示例数据帧df

np.random.seed([3,1415])
df = pd.DataFrame(dict(
        name=np.random.choice(
            list('abcdefghij'),1000,p=np.arange(10,-1) / 55
        )
    ))
threshold = 60
counts = df.name.value_counts()
counts

a    197
b    166
c    139
d    119
f    107
e    105
g     72
h     53
i     27
j     15
Name: name,dtype: int64

然后替换和pd.get_dummies

repl = counts[counts <= threshold].index
print(pd.get_dummies(df.name.replace(repl,'uncommon')))

     a  b  c  d  e  f  g  uncommon
0    0  0  1  0  0  0  0         0
1    0  0  1  0  0  0  0         0
2    0  0  1  0  0  0  0         0
3    0  0  1  0  0  0  0         0
4    0  0  1  0  0  0  0         0
5    1  0  0  0  0  0  0         0
6    0  0  0  0  0  0  1         0
7    0  0  0  0  0  1  0         0
8    0  0  0  0  0  1  0         0
9    0  0  0  0  0  1  0         0
10   0  0  0  0  0  0  0         1
11   0  0  0  0  0  0  1         0
12   0  0  0  0  0  0  1         0
13   0  0  0  0  0  0  0         1
14   0  0  0  0  1  0  0         0
15   1  0  0  0  0  0  0         0
16   1  0  0  0  0  0  0         0
17   0  1  0  0  0  0  0         0

（编辑：李大同）

【声明】本站内容均来自网络，其相关言论仅代表作者个人观点，不代表本站立场。若无意侵犯到您的权利，请及时与联系站长删除相关内容!