scala – 为什么Spark ML NaiveBayes输出的标签与训练数据不同?
我在
Apache Spark ML(版本1.5.1)中使用
NaiveBayes分类器来预测某些文本类别.但是,分类器输出的标签与我的训练集中的标签不同.我做错了吗?
这是一个可以粘贴到例如Zeppelin笔记本: import org.apache.spark.ml.Pipeline import org.apache.spark.ml.classification.NaiveBayes import org.apache.spark.ml.feature.{HashingTF,Tokenizer} import org.apache.spark.mllib.linalg.Vector import org.apache.spark.sql.Row // Prepare training documents from a list of (id,text,label) tuples. val training = sqlContext.createDataFrame(Seq( (0L,"X totally sucks :-(",100.0),(1L,"Today was kind of meh",200.0),(2L,"I'm so happy :-)",300.0) )).toDF("id","text","label") // Configure an ML pipeline,which consists of three stages: tokenizer,hashingTF,and lr. val tokenizer = new Tokenizer() .setInputCol("text") .setOutputCol("words") val hashingTF = new HashingTF() .setNumFeatures(1000) .setInputCol(tokenizer.getOutputCol) .setOutputCol("features") val nb = new NaiveBayes() val pipeline = new Pipeline() .setStages(Array(tokenizer,nb)) // Fit the pipeline to training documents. val model = pipeline.fit(training) // Prepare test documents,which are unlabeled (id,text) tuples. val test = sqlContext.createDataFrame(Seq( (4L,"roller coasters are fun :-)"),(5L,"i burned my bacon :-("),(6L,"the movie is kind of meh") )).toDF("id","text") // Make predictions on test documents. model.transform(test) .select("id","prediction") .collect() .foreach { case Row(id: Long,text: String,prediction: Double) => println(s"($id,$text) --> prediction=$prediction") } 小程序的输出: (4,roller coasters are fun :-)) --> prediction=2.0 (5,i burned my bacon :-() --> prediction=0.0 (6,the movie is kind of meh) --> prediction=1.0 预测标签{0.0,1.0,2.0}的集合与我的训练集标签{100.0,200.0,300.0}不相交. 问题:如何将这些预测标签映射回原始训练集标签? 奖金问题:为什么训练集标签必须是双打,当任何其他类型的标签和标签一样好?似乎没必要. 解决方法
的种类.据我所知,你会遇到SPARK-9137所描述的问题.一般来说,ML中的所有分类器都需要基于0的标签(0.0,2.0,…),但ml.NaiveBayes中没有验证步骤.引擎盖下的数据传递给mllib.NaiveBayes,它没有这个限制,因此训练过程顺利进行. 当模型转换回ml时,预测函数只是假设标签在正确的位置,而returns predicted label using
我想这主要是保持简单和可重用的API.这样,LabeledPoint可用于分类和回归问题.此外,它在内存使用和计算成本方面是一种有效的表示. (编辑:李大同) 【声明】本站内容均来自网络,其相关言论仅代表作者个人观点,不代表本站立场。若无意侵犯到您的权利,请及时与联系站长删除相关内容! |