Scala – 如何将我们在将GMM模型拟合到数据时获得的概率列(向量
发布时间:2020-12-16 18:09:36 所属栏目:安全 来源:网络整理
导读:参见英文答案 Spark Scala: How to convert Dataframe[vector] to DataFrame[f1:Double,…,fn: Double)]????????????????????????????????????2个 我正在尝试执行以下操作: +-----+-------------------------+----------+---------------------------------
参见英文答案 >
Spark Scala: How to convert Dataframe[vector] to DataFrame[f1:Double,…,fn: Double)]????????????????????????????????????2个
我正在尝试执行以下操作: +-----+-------------------------+----------+-------------------------------------------+ |label|features |prediction|probability | +-----+-------------------------+----------+-------------------------------------------+ |0.0 |(3,[],[]) |0 |[0.9999999999999979,2.093996169658831E-15] | |1.0 |(3,[0,1,2],[0.1,0.1,0.1])|0 |[0.999999999999999,9.891337521299582E-16] | |2.0 |(3,[0.2,0.2,0.2])|0 |[0.9999999999999979,2.0939961696578572E-15]| |3.0 |(3,[9.0,9.0,9.0])|1 |[2.093996169659668E-15,0.9999999999999979] | |4.0 |(3,[9.1,9.1,9.1])|1 |[9.89133752128275E-16,0.999999999999999] | |5.0 |(3,[9.2,9.2,9.2])|1 |[2.0939961696605603E-15,0.9999999999999979]| +-----+-------------------------+----------+-------------------------------------------+ 将上面的数据帧转换为另外两列:prob1& prob2 我发现了类似的问题 – 一个在PySpark,另一个在Scala.我不知道如何翻译PySpark代码,我收到了Scala代码的错误. PySpark代码: split1_udf = udf(lambda value: value[0].item(),FloatType()) split2_udf = udf(lambda value: value[1].item(),FloatType()) output2 = randomforestoutput.select(split1_udf('probability').alias('c1'),split2_udf('probability').alias('c2')) 或者将这些列附加到原始数据框: randomforestoutput.withColumn('c1',split1_udf('probability')).withColumn('c2',split2_udf('probability')) Scala代码: import org.apache.spark.sql.functions.udf val getPOne = udf((v: org.apache.spark.mllib.linalg.Vector) => v(1)) model.transform(testDf).select(getPOne($"probability")) 运行Scala代码时出现以下错误: scala> predictions.select(getPOne(col("probability"))).show(false) org.apache.spark.sql.AnalysisException: cannot resolve 'UDF(probability)' due to data type mismatch: argument 1 requires vector type,however,'`probability`' is of vector type.;; 'Project [UDF(probability#39) AS UDF(probability)#135] +- Project [label#0,features#1,prediction#34,UDF(features#1) AS probability#39] +- Project [label#0,UDF(features#1) AS prediction#34] +- Relation[label#0,features#1] libsvm 我目前正在使用Scala 2.11.11和Spark 2.1.1 解决方法
我从你的问题中理解的是,你试图将概率列分成两列prob1和prob2.如果是这种情况,那么带有withColumn的简单数组功能应该可以解决您的问题.
predictions .withColumn("prob1",$"probability"(0)) .withColumn("prob2",$"probability"(1)) .drop("probability") 您可以找到可以帮助您将来应用于数据帧的more functions. 编辑 我创建了一个与您的列匹配的临时数据帧 val predictions = Seq(Array(1.0,2.0),Array(2.0939961696605603E-15,0.9999999999999979),Array(Double.NaN,Double.NaN)).toDF("probability") +--------------------------------------------+ |probability | +--------------------------------------------+ |[1.0,2.0] | |[2.0939961696605603E-15,0.9999999999999979]| |[NaN,NaN] | +--------------------------------------------+ 并且我将上面的结果应用于了 +----------------------+------------------+ |prob1 |prob2 | +----------------------+------------------+ |1.0 |2.0 | |2.0939961696605603E-15|0.9999999999999979| |NaN |NaN | +----------------------+------------------+ 架构不匹配编辑 既然由于概率列的Vector模式与上面的arrayType模式解决方案不匹配,上面的解决方案不适用于您的情况.请使用以下解决方案. 您将必须创建udf函数并按预期返回值 val first = udf((v: Vector) => v.toArray(0)) val second = udf((v: Vector) => v.toArray(1)) predictions .withColumn("prob1",first($"probability")) .withColumn("prob2",second($"probability")) .drop("probability") 我希望你能得到理想的结果. (编辑:李大同) 【声明】本站内容均来自网络,其相关言论仅代表作者个人观点,不代表本站立场。若无意侵犯到您的权利,请及时与联系站长删除相关内容! |