scala – 如何将spark DataFrame转换为RDD mllib LabeledPoints
发布时间:2020-12-16 08:58:24 所属栏目:安全 来源:网络整理
导读:我尝试将PCA应用于我的数据,然后将RandomForest应用于转换后的数据.但是,PCA.transform(data)给了我一个DataFrame,但我需要一个mllib LabeledPoints来提供我的RandomForest.我怎样才能做到这一点? 我的代码: import org.apache.spark.mllib.util.MLUtils i
|
我尝试将PCA应用于我的数据,然后将RandomForest应用于转换后的数据.但是,PCA.transform(data)给了我一个DataFrame,但我需要一个mllib LabeledPoints来提供我的RandomForest.我怎样才能做到这一点?
我的代码: import org.apache.spark.mllib.util.MLUtils
import org.apache.spark.{SparkConf,SparkContext}
import org.apache.spark.mllib.tree.RandomForest
import org.apache.spark.mllib.tree.model.RandomForestModel
import org.apache.spark.ml.feature.PCA
import org.apache.spark.mllib.regression.LabeledPoint
import org.apache.spark.mllib.linalg.Vectors
val dataset = MLUtils.loadLibSVMFile(sc,"data/mnist/mnist.bz2")
val splits = dataset.randomSplit(Array(0.7,0.3))
val (trainingData,testData) = (splits(0),splits(1))
val trainingDf = trainingData.toDF()
val pca = new PCA()
.setInputCol("features")
.setOutputCol("pcaFeatures")
.setK(100)
.fit(trainingDf)
val pcaTrainingData = pca.transform(trainingDf)
val numClasses = 10
val categoricalFeaturesInfo = Map[Int,Int]()
val numTrees = 10 // Use more in practice.
val featureSubsetStrategy = "auto" // Let the algorithm choose.
val impurity = "gini"
val maxDepth = 20
val maxBins = 32
val model = RandomForest.trainClassifier(pcaTrainingData,numClasses,categoricalFeaturesInfo,numTrees,featureSubsetStrategy,impurity,maxDepth,maxBins)
error: type mismatch;
found : org.apache.spark.sql.DataFrame
required: org.apache.spark.rdd.RDD[org.apache.spark.mllib.regression.LabeledPoint]
我尝试了以下两种可能的解决方案,但它们不起作用: scala> val pcaTrainingData = trainingData.map(p => p.copy(features = pca.transform(p.features))) <console>:39: error: overloaded method value transform with alternatives: (dataset: org.apache.spark.sql.DataFrame)org.apache.spark.sql.DataFrame <and> (dataset: org.apache.spark.sql.DataFrame,paramMap: org.apache.spark.ml.param.ParamMap)org.apache.spark.sql.DataFrame <and> (dataset: org.apache.spark.sql.DataFrame,firstParamPair: org.apache.spark.ml.param.ParamPair[_],otherParamPairs: org.apache.spark.ml.param.ParamPair[_]*)org.apache.spark.sql.DataFrame cannot be applied to (org.apache.spark.mllib.linalg.Vector) 和: val labeled = pca
.transform(trainingDf)
.map(row => LabeledPoint(row.getDouble(0),row(4).asInstanceOf[Vector[Int]]))
error: type mismatch;
found : scala.collection.immutable.Vector[Int]
required: org.apache.spark.mllib.linalg.Vector
(我在上面的例子中导入了org.apache.spark.mllib.linalg.Vectors) 有帮助吗? 解决方法
这里正确的方法是你尝试的第二个 – 将每一行映射到LabeledPoint以获得RDD [LabeledPoint].但是,它有两个错误:
>正确的Vector类(org.apache.spark.mllib.linalg.Vector)不接受类型参数(例如Vector [Int]) – 所以即使你有正确的导入,编译器也认为你的意思是scala.collection. immutable.Vector哪个. +-----+--------------------+--------------------+ |label| features| pcaFeatures| +-----+--------------------+--------------------+ | 5.0|(780,[152,153,154...|[880.071111851977...| | 1.0|(780,[158,159,160...|[-41.473039034112...| | 2.0|(780,[155,156,157...|[931.444898405036...| | 1.0|(780,[124,125,126...|[25.5114585648411...| +-----+--------------------+--------------------+ 为了使这更容易 – 我更喜欢使用列名而不是它们的索引. 所以这是你需要的转变: val labeled = pca.transform(trainingDf).rdd.map(row => LabeledPoint(
row.getAs[Double]("label"),row.getAs[org.apache.spark.mllib.linalg.Vector]("pcaFeatures")
))
(编辑:李大同) 【声明】本站内容均来自网络,其相关言论仅代表作者个人观点,不代表本站立场。若无意侵犯到您的权利,请及时与联系站长删除相关内容! |
