python – 稀疏矢量pyspark
发布时间:2020-12-20 12:12:36 所属栏目:Python 来源:网络整理
导读:我想找到一种使用数据帧在PySpark中创建备用向量的有效方法. 让我们说给出交易输入: df = spark.createDataFrame([ (0,"a"),(1,"b"),"c"),(2,(0,"cc"),(3,(4,(5,"c")],["id","category"]) +---+--------+| id|category|+---+--------+| 0| a|| 1| a|| 1| b|
我想找到一种使用数据帧在PySpark中创建备用向量的有效方法.
让我们说给出交易输入: df = spark.createDataFrame([ (0,"a"),(1,"b"),"c"),(2,(0,"cc"),(3,(4,(5,"c") ],["id","category"]) +---+--------+ | id|category| +---+--------+ | 0| a| | 1| a| | 1| b| | 1| c| | 2| a| | 2| b| | 2| b| | 2| b| | 2| c| | 0| a| | 1| b| | 1| b| | 2| cc| | 3| a| | 4| a| | 5| c| +---+--------+ 总结格式: df.groupBy(df["id"],df["category"]).count().show() +---+--------+-----+ | id|category|count| +---+--------+-----+ | 1| b| 3| | 1| a| 1| | 1| c| 1| | 2| cc| 1| | 2| c| 1| | 2| a| 1| | 1| a| 1| | 0| a| 2| +---+--------+-----+ 我的目标是通过id得到这个输出: +---+-----------------------------------------------+ | id| feature | +---+-----------------------------------------------+ | 2|SparseVector({a: 1.0,b: 3.0,c: 1.0,cc: 1.0})| 你能指点我正确的方向吗?使用Java中的mapreduce对我来说似乎更容易. 解决方法
使用pivot和VectorAssembler可以非常轻松地完成此操作.用pivot替换聚合:
pivoted = df.groupBy("id").pivot("category").count().na.fill(0) 和组装: from pyspark.ml.feature import VectorAssembler input_cols = [x for x in pivoted.columns if x != id] result = (VectorAssembler(inputCols=input_cols,outputCol="features") .transform(pivoted) .select("id","features")) 结果如下.这将根据稀疏性选择更有效的表示: +---+---------------------+ |id |features | +---+---------------------+ |0 |(5,[1],[2.0]) | |5 |(5,[0,3],[5.0,1.0]) | |1 |[1.0,1.0,3.0,0.0]| |3 |(5,1],[3.0,1.0]) | |2 |[2.0,1.0]| |4 |(5,[4.0,1.0]) | +---+---------------------+ 但当然你仍然可以将它转换为单一的表示形式: from pyspark.ml.linalg import SparseVector,VectorUDT import numpy as np def to_sparse(c): def to_sparse_(v): if isinstance(v,SparseVector): return v vs = v.toArray() nonzero = np.nonzero(vs)[0] return SparseVector(v.size,nonzero,vs[nonzero]) return udf(to_sparse_,VectorUDT())(c) +---+-------------------------------------+ |id |features | +---+-------------------------------------+ |0 |(5,[2.0]) | |5 |(5,1.0]) | |1 |(5,1,2,[1.0,1.0]) | |3 |(5,1.0]) | |2 |(5,3,4],[2.0,1.0])| |4 |(5,1.0]) | +---+-------------------------------------+ (编辑:李大同) 【声明】本站内容均来自网络,其相关言论仅代表作者个人观点,不代表本站立场。若无意侵犯到您的权利,请及时与联系站长删除相关内容! |