scala – Spark多类分类示例
你们知道我在哪里可以找到Spark中多类分类的例子.我花了很多时间在书本和网络上搜索,到目前为止,我只知道根据文档的最新版本是可能的.
解决方法
ML
(在Spark 2.0中推荐) 我们将使用与下面的MLlib相同的数据.有两个基本选项.如果Estimator支持开箱即用的多类分类(例如随机林),您可以直接使用它: val trainRawDf = trainRaw.toDF import org.apache.spark.ml.feature.{Tokenizer,CountVectorizer,StringIndexer} import org.apache.spark.ml.Pipeline import org.apache.spark.ml.classification.RandomForestClassifier val transformers = Array( new StringIndexer().setInputCol("group").setOutputCol("label"),new Tokenizer().setInputCol("text").setOutputCol("tokens"),new CountVectorizer().setInputCol("tokens").setOutputCol("features") ) val rf = new RandomForestClassifier() .setLabelCol("label") .setFeaturesCol("features") val model = new Pipeline().setStages(transformers :+ rf).fit(trainRawDf) model.transform(trainRawDf) 如果模型仅支持二进制分类(逻辑回归)并扩展o.a.s.ml.classification.Classifier,则可以使用one-vs-rest策略: import org.apache.spark.ml.classification.OneVsRest import org.apache.spark.ml.classification.LogisticRegression val lr = new LogisticRegression() .setLabelCol("label") .setFeaturesCol("features") val ovr = new OneVsRest().setClassifier(lr) val ovrModel = new Pipeline().setStages(transformers :+ ovr).fit(trainRawDf) MLLib 根据此时的official documentation(MLlib 1.6.0)以下方法支持多类分类: >逻辑回归, 至少有一些例子使用多类分类: > Naive Bayes example – 3节课 忽略方法特定参数的一般框架与MLlib中的所有其他方法几乎相同.您必须预处理输入以创建包含表示标签和要素的列的数据框: root |-- label: double (nullable = true) |-- features: vector (nullable = true) 或RDD [LabeledPoint]. Spark提供了广泛的有用工具,旨在促进此过程,包括Feature Extractors和Feature Transformers和pipelines. 您将在下面找到一个使用随机森林的相当天真的例子. 首先让我们导入所需的包并创建虚拟数据: import sqlContext.implicits._ import org.apache.spark.ml.feature.{HashingTF,Tokenizer} import org.apache.spark.mllib.regression.LabeledPoint import org.apache.spark.ml.feature.StringIndexer import org.apache.spark.mllib.tree.RandomForest import org.apache.spark.mllib.tree.model.RandomForestModel import org.apache.spark.mllib.linalg.{Vectors,Vector} import org.apache.spark.mllib.evaluation.MulticlassMetrics import org.apache.spark.sql.Row import org.apache.spark.rdd.RDD case class LabeledRecord(group: String,text: String) val trainRaw = sc.parallelize( LabeledRecord("foo","foo v a y b foo") :: LabeledRecord("bar","x bar y bar v") :: LabeledRecord("bar","x a y bar z") :: LabeledRecord("foobar","foo v b bar z") :: LabeledRecord("foo","foo x") :: LabeledRecord("foobar","z y x foo a b bar v") :: Nil ) 现在让我们定义所需的变换器和过程训练数据集: // Tokenizer to process text fields val tokenizer = new Tokenizer() .setInputCol("text") .setOutputCol("words") // HashingTF to convert tokens to the feature vector val hashingTF = new HashingTF() .setInputCol("words") .setOutputCol("features") .setNumFeatures(10) // Indexer to convert String labels to Double val indexer = new StringIndexer() .setInputCol("group") .setOutputCol("label") .fit(trainRaw.toDF) def transfom(rdd: RDD[LabeledRecord]) = { val tokenized = tokenizer.transform(rdd.toDF) val hashed = hashingTF.transform(tokenized) val indexed = indexer.transform(hashed) indexed .select($"label",$"features") .map{case Row(label: Double,features: Vector) => LabeledPoint(label,features)} } val train: RDD[LabeledPoint] = transfom(trainRaw) 请注意,索引器已“安装”在列车数据上.它只是意味着用作标签的分类值被转换为双精度数.要在新数据上使用分类器,您必须首先使用此索引器对其进行转换. 接下来我们可以训练RF模型: val numClasses = 3 val categoricalFeaturesInfo = Map[Int,Int]() val numTrees = 10 val featureSubsetStrategy = "auto" val impurity = "gini" val maxDepth = 4 val maxBins = 16 val model = RandomForest.trainClassifier( train,numClasses,categoricalFeaturesInfo,numTrees,featureSubsetStrategy,impurity,maxDepth,maxBins ) 最后测试一下: val testRaw = sc.parallelize( LabeledRecord("foo","foo foo z z z") :: LabeledRecord("bar","z bar y y v") :: LabeledRecord("bar","a a bar a z") :: LabeledRecord("foobar","foo v b bar z") :: LabeledRecord("foobar","a foo a bar") :: Nil ) val test: RDD[LabeledPoint] = transfom(testRaw) val predsAndLabs = test.map(lp => (model.predict(lp.features),lp.label)) val metrics = new MulticlassMetrics(predsAndLabs) metrics.precision metrics.recall (编辑:李大同) 【声明】本站内容均来自网络,其相关言论仅代表作者个人观点,不代表本站立场。若无意侵犯到您的权利,请及时与联系站长删除相关内容! |