Java / Scala中的内存olap / pivot表中是否有数据结构/库?
相关问题
这个问题非常相关,但是2岁:In memory OLAP engine in Java 背景 我想在内存中从给定的表格数据集创建一个像表格一样的数据透视表 例如按婚姻状况计算的年龄(行是年龄,列是婚姻状况). >输入:人员列表,包含年龄和一些布尔属性(例如已婚), 我尝试了什么(Scala) case class Person(val age:Int,val isMarried:Boolean) ... val people:List[Person] = ... // val peopleByAge = people.groupBy(_.age) //only by age val peopleByMaritalStatus = people.groupBy(_.isMarried) //only by marital status 我设法以天真的方式做到这一点,首先按年龄分组,然后根据婚姻状况进行计数,并输出结果,然后我foldRight汇总 TreeMap(peopleByAge.toSeq: _*).map(x => { val age = x._1 val rows = x._2 val numMarried = rows.count(_.isMarried()) val numNotMarried = rows.length - numMarried (age,numMarried,numNotMarried) }).foldRight(List[FinalResult]())(row,list) => { val cumMarried = row._2+ (if (list.isEmpty) 0 else list.last.cumMarried) val cumNotMarried = row._3 + (if (list.isEmpty) 0 else l.last.cumNotMarried) list :+ new FinalResult(row._1,row._2,row._3,cumMarried,cumNotMarried) }.reverse 我不喜欢上面的代码,它不高效,难以阅读,我确信有更好的方法. 问题 我怎么groupBy“两个”?以及如何对每个子组进行计数,例如
另一个问题是,我如何做一个总计,回答这个问题:
编辑: 谢谢你所有的好答案. 为了澄清,我希望输出包含一个带有以下列的“表” >年龄(升序) 不仅要回答这些特定的问题,还要制作一份能够回答所有此类问题的报告. 解决方法
这是一个更冗长的选项,但这是以通用方式而不是使用严格的数据类型.你当然可以使用泛型来使这个更好,但我认为你明白了.
/** Creates a new pivot structure by finding correlated values * and performing an operation on these values * * @param accuOp the accumulator function (e.g. sum,max,etc) * @param xCol the "x" axis column * @param yCol the "y" axis column * @param accuCol the column to collect and perform accuOp on * @return a new Pivot instance that has been transformed with the accuOp function */ def doPivot(accuOp: List[String] => String)(xCol: String,yCol: String,accuCol: String) = { // create list of indexes that correlate to x,y,accuCol val colsIdx = List(xCol,yCol,accuCol).map(headers.getOrElse(_,1)) // group by x and y,sending the resulting collection of // accumulated values to the accuOp function for post-processing val data = body.groupBy(row => { (row(colsIdx(0)),row(colsIdx(1))) }).map(g => { (g._1,accuOp(g._2.map(_(colsIdx(2))))) }).toMap // get distinct axis values val xAxis = data.map(g => {g._1._1}).toList.distinct val yAxis = data.map(g => {g._1._2}).toList.distinct // create result matrix val newRows = yAxis.map(y => { xAxis.map(x => { data.getOrElse((x,y),"") }) }) // collect it with axis labels for results Pivot(List((yCol + "/" + xCol) +: xAxis) ::: newRows.zip(yAxis).map(x=> {x._2 +: x._1})) } 我的Pivot类型非常基本: class Pivot(val rows: List[List[String]]) { val headers = rows.head.zipWithIndex.toMap val body = rows.tail ... } 要测试它,你可以这样做: val marriedP = Pivot( List( List("Name","Age","Married"),List("Bill","42","TRUE"),List("Heloise","47",List("Thelma","34","FALSE"),List("Bridget",List("Robert",List("Eddie","TRUE") ) ) def accum(values: List[String]) = { values.map(x => {1}).sum.toString } println(marriedP + "n") println(marriedP.doPivot(accum)("Age","Married","Married")) 产量: Name Age Married Bill 42 TRUE Heloise 47 TRUE Thelma 34 FALSE Bridget 47 TRUE Robert 42 FALSE Eddie 42 TRUE Married/Age 47 42 34 TRUE 2 2 FALSE 1 1 好处是你可以使用currying传递值的任何函数,就像在传统的excel数据透视表中一样. 更多信息可以在这里找到:https://github.com/vinsonizer/pivotfun (编辑:李大同) 【声明】本站内容均来自网络,其相关言论仅代表作者个人观点,不代表本站立场。若无意侵犯到您的权利,请及时与联系站长删除相关内容! |