python – 稀疏矢量pyspark

发布时间：2020-12-20 12:12:36 所属栏目：Python 来源：网络整理

导读：我想找到一种使用数据帧在PySpark中创建备用向量的有效方法. 让我们说给出交易输入： df = spark.createDataFrame([ (0,"a"),(1,"b"),"c"),(2,(0,"cc"),(3,(4,(5,"c")],["id","category"]) +---+--------+| id|category|+---+--------+| 0| a|| 1| a|| 1| b|

我想找到一种使用数据帧在PySpark中创建备用向量的有效方法.

让我们说给出交易输入：

df = spark.createDataFrame([
    (0,"a"),(1,"b"),"c"),(2,(0,"cc"),(3,(4,(5,"c")
],["id","category"])

+---+--------+
| id|category|
+---+--------+
|  0|       a|
|  1|       a|
|  1|       b|
|  1|       c|
|  2|       a|
|  2|       b|
|  2|       b|
|  2|       b|
|  2|       c|
|  0|       a|
|  1|       b|
|  1|       b|
|  2|      cc|
|  3|       a|
|  4|       a|
|  5|       c|
+---+--------+

总结格式：

df.groupBy(df["id"],df["category"]).count().show()

+---+--------+-----+
| id|category|count|
+---+--------+-----+
|  1|       b|    3|
|  1|       a|    1|
|  1|       c|    1|
|  2|      cc|    1|
|  2|       c|    1|
|  2|       a|    1|
|  1|       a|    1|
|  0|       a|    2|
+---+--------+-----+

我的目标是通过id得到这个输出：

+---+-----------------------------------------------+
| id|                                       feature |
+---+-----------------------------------------------+
|  2|SparseVector({a: 1.0,b: 3.0,c: 1.0,cc: 1.0})|

你能指点我正确的方向吗？使用Java中的mapreduce对我来说似乎更容易.

解决方法

使用pivot和VectorAssembler可以非常轻松地完成此操作.用pivot替换聚合：

pivoted = df.groupBy("id").pivot("category").count().na.fill(0)

和组装：

from pyspark.ml.feature import VectorAssembler

input_cols = [x for x in pivoted.columns if x != id]

result = (VectorAssembler(inputCols=input_cols,outputCol="features")
    .transform(pivoted)
    .select("id","features"))

结果如下.这将根据稀疏性选择更有效的表示：

+---+---------------------+
|id |features             |
+---+---------------------+
|0  |(5,[1],[2.0])        |
|5  |(5,[0,3],[5.0,1.0])  |
|1  |[1.0,1.0,3.0,0.0]|
|3  |(5,1],[3.0,1.0])  |
|2  |[2.0,1.0]|
|4  |(5,[4.0,1.0])  |
+---+---------------------+

但当然你仍然可以将它转换为单一的表示形式：

from pyspark.ml.linalg import SparseVector,VectorUDT
import numpy as np

def to_sparse(c):
    def to_sparse_(v):
        if isinstance(v,SparseVector):
            return v
        vs = v.toArray()
        nonzero = np.nonzero(vs)[0]
        return SparseVector(v.size,nonzero,vs[nonzero])
    return udf(to_sparse_,VectorUDT())(c)

+---+-------------------------------------+
|id |features                             |
+---+-------------------------------------+
|0  |(5,[2.0])                        |
|5  |(5,1.0])                  |
|1  |(5,1,2,[1.0,1.0])      |
|3  |(5,1.0])                  |
|2  |(5,3,4],[2.0,1.0])|
|4  |(5,1.0])                  |
+---+-------------------------------------+

（编辑：李大同）

【声明】本站内容均来自网络，其相关言论仅代表作者个人观点，不代表本站立场。若无意侵犯到您的权利，请及时与联系站长删除相关内容!