算法 – scala中位数实现
scala中的快速实施是什么?
这是我在rosetta code发现的: def median(s: Seq[Double]) = { val (lower,upper) = s.sortWith(_<_).splitAt(s.size / 2) if (s.size % 2 == 0) (lower.last + upper.head) / 2.0 else upper.head } 我不喜欢它,因为它是一种排序。我知道有一些计算线性时间中值的方法。 编辑: 我想要一套中间函数,我可以在各种情况下使用: >快速,中位数计算可以在线性时间内完成 请只发布编译和正确计算中位数的代码。为了简单起见,您可以假设所有输入都包含奇数个值。 解决方法
不可变算法
first algorithm indicated Taylor Leese是二次的,但是具有线性平均。然而,这取决于枢轴选择。所以我在这里提供一个版本,它具有可插拔的枢轴选择,随机枢纽和中位数的中位数(保证线性时间)。 import scala.annotation.tailrec @tailrec def findKMedian(arr: Array[Double],k: Int)(implicit choosePivot: Array[Double] => Double): Double = { val a = choosePivot(arr) val (s,b) = arr partition (a >) if (s.size == k) a // The following test is used to avoid infinite repetition else if (s.isEmpty) { val (s,b) = arr partition (a ==) if (s.size > k) a else findKMedian(b,k - s.size) } else if (s.size < k) findKMedian(b,k - s.size) else findKMedian(s,k) } def findMedian(arr: Array[Double])(implicit choosePivot: Array[Double] => Double) = findKMedian(arr,(arr.size - 1) / 2) 随机数轴(二次,线性平均),不变 这是随机的选择。随机因子算法的分析比正常情况更为棘手,因为它主要涉及概率和统计学。 def chooseRandomPivot(arr: Array[Double]): Double = arr(scala.util.Random.nextInt(arr.size)) 中位数(线性),不变 中值方法的中位数,保证与上述算法一起使用的线性时间。首先,算法计算最多5个数字的中值,这是中值算法的中值的基础。这是Rex Kerr Rex Kerr提供的 – 算法取决于它的速度。 def medianUpTo5(five: Array[Double]): Double = { def order2(a: Array[Double],i: Int,j: Int) = { if (a(i)>a(j)) { val t = a(i); a(i) = a(j); a(j) = t } } def pairs(a: Array[Double],j: Int,k: Int,l: Int) = { if (a(i)<a(k)) { order2(a,j,k); a(j) } else { order2(a,i,l); a(i) } } if (five.length < 2) return five(0) order2(five,1) if (five.length < 4) return ( if (five.length==2 || five(2) < five(0)) five(0) else if (five(2) > five(1)) five(1) else five(2) ) order2(five,2,3) if (five.length < 5) pairs(five,1,3) else if (five(0) < five(2)) { order2(five,4); pairs(five,4,3) } else { order2(five,3,4) } } 然后,中值算法本身的中位数。基本上,它保证选择枢轴将大于列表的其他30%的至少30%并且小于其他30%,这足以保证先前算法的线性。查看另一个答案中提供的维基百科链接的详细信息。 def medianOfMedians(arr: Array[Double]): Double = { val medians = arr grouped 5 map medianUpTo5 toArray; if (medians.size <= 5) medianUpTo5 (medians) else medianOfMedians(medians) } 就地算法 所以,这里是一个就地版本的算法。我正在使用一个实现一个分区的类,并使用一个后备数组,这样算法的变化是最小的。 case class ArrayView(arr: Array[Double],from: Int,until: Int) { def apply(n: Int) = if (from + n < until) arr(from + n) else throw new ArrayIndexOutOfBoundsException(n) def partitionInPlace(p: Double => Boolean): (ArrayView,ArrayView) = { var upper = until - 1 var lower = from while (lower < upper) { while (lower < until && p(arr(lower))) lower += 1 while (upper >= from && !p(arr(upper))) upper -= 1 if (lower < upper) { val tmp = arr(lower); arr(lower) = arr(upper); arr(upper) = tmp } } (copy(until = lower),copy(from = lower)) } def size = until - from def isEmpty = size <= 0 override def toString = arr mkString ("ArraySize(",",")") }; object ArrayView { def apply(arr: Array[Double]) = new ArrayView(arr,arr.size) } @tailrec def findKMedianInPlace(arr: ArrayView,k: Int)(implicit choosePivot: ArrayView => Double): Double = { val a = choosePivot(arr) val (s,b) = arr partitionInPlace (a >) if (s.size == k) a // The following test is used to avoid infinite repetition else if (s.isEmpty) { val (s,b) = arr partitionInPlace (a ==) if (s.size > k) a else findKMedianInPlace(b,k - s.size) } else if (s.size < k) findKMedianInPlace(b,k - s.size) else findKMedianInPlace(s,k) } def findMedianInPlace(arr: Array[Double])(implicit choosePivot: ArrayView => Double) = findKMedianInPlace(ArrayView(arr),(arr.size - 1) / 2) 随机数据透视,就地 由于中位数的中位数要比我定义的ArrayView类提供的更多的支持需要更多的支持,所以我只是为了实现就地算法来实现radom枢纽。 def chooseRandomPivotInPlace(arr: ArrayView): Double = arr(scala.util.Random.nextInt(arr.size)) 直方图算法(O(log(n))),不变 所以,关于流。对于只能遍历一次的流,不能对O(n)内存进行任何操作,除非你碰巧知道字符串长度(在这种情况下,它不再是我的书中的流)。 使用桶也是有问题的,但是如果我们可以遍历它多次,那么我们可以知道它的大小,最大和最小,并从那里工作。例如: def findMedianHistogram(s: Traversable[Double]) = { def medianHistogram(s: Traversable[Double],discarded: Int,medianIndex: Int): Double = { // The buckets def numberOfBuckets = (math.log(s.size).toInt + 1) max 2 val buckets = new Array[Int](numberOfBuckets) // The upper limit of each bucket val max = s.max val min = s.min val increment = (max - min) / numberOfBuckets val indices = (-numberOfBuckets + 1 to 0) map (max + increment * _) // Return the bucket a number is supposed to be in def bucketIndex(d: Double) = indices indexWhere (d <=) // Compute how many in each bucket s foreach { d => buckets(bucketIndex(d)) += 1 } // Now make the buckets cumulative val partialTotals = buckets.scanLeft(discarded)(_+_).drop(1) // The bucket where our target is at val medianBucket = partialTotals indexWhere (medianIndex <) // Keep track of how many numbers there are that are less // than the median bucket val newDiscarded = if (medianBucket == 0) discarded else partialTotals(medianBucket - 1) // Test whether a number is in the median bucket def insideMedianBucket(d: Double) = bucketIndex(d) == medianBucket // Get a view of the target bucket val view = s.view filter insideMedianBucket // If all numbers in the bucket are equal,return that if (view forall (view.head ==)) view.head // Otherwise,recurse on that bucket else medianHistogram(view,newDiscarded,medianIndex) } medianHistogram(s,(s.size - 1) / 2) } 测试和基准测试 为了测试算法,我使用Scalacheck,并将每个算法的输出与通过排序的平凡实现的输出进行比较。这当然是排序版本是正确的。 我使用所有提供的枢轴选择,以及固定的枢轴选择(阵列的中间,向下))对上述每个算法进行基准测试。每个算法都使用三种不同的输入阵列大小进行测试,并针对每种算法进行三次测试。 以下是测试代码: import org.scalacheck.{Prop,Pretty,Test} import Prop._ import Pretty._ def test(algorithm: Array[Double] => Double,reference: Array[Double] => Double): String = { def prettyPrintArray(arr: Array[Double]) = arr mkString ("Array(",")") val resultEqualsReference = forAll { (arr: Array[Double]) => arr.nonEmpty ==> (algorithm(arr) == reference(arr)) :| prettyPrintArray(arr) } Test.check(Test.Params(),resultEqualsReference)(Pretty.Params(verbosity = 0)) } import java.lang.System.currentTimeMillis def bench[A](n: Int)(body: => A): Long = { val start = currentTimeMillis() 1 to n foreach { _ => body } currentTimeMillis() - start } import scala.util.Random.nextDouble def benchmark(algorithm: Array[Double] => Double,arraySizes: List[Int]): List[Iterable[Long]] = for (size <- arraySizes) yield for (iteration <- 1 to 3) yield bench(50000)(algorithm(Array.fill(size)(nextDouble))) def testAndBenchmark: String = { val immutablePivotSelection: List[(String,Array[Double] => Double)] = List( "Random Pivot" -> chooseRandomPivot,"Median of Medians" -> medianOfMedians,"Midpoint" -> ((arr: Array[Double]) => arr((arr.size - 1) / 2)) ) val inPlacePivotSelection: List[(String,ArrayView => Double)] = List( "Random Pivot (in-place)" -> chooseRandomPivotInPlace,"Midpoint (in-place)" -> ((arr: ArrayView) => arr((arr.size - 1) / 2)) ) val immutableAlgorithms = for ((name,pivotSelection) <- immutablePivotSelection) yield name -> (findMedian(_: Array[Double])(pivotSelection)) val inPlaceAlgorithms = for ((name,pivotSelection) <- inPlacePivotSelection) yield name -> (findMedianInPlace(_: Array[Double])(pivotSelection)) val histogramAlgorithm = "Histogram" -> ((arr: Array[Double]) => findMedianHistogram(arr)) val sortingAlgorithm = "Sorting" -> ((arr: Array[Double]) => arr.sorted.apply((arr.size - 1) / 2)) val algorithms = sortingAlgorithm :: histogramAlgorithm :: immutableAlgorithms ::: inPlaceAlgorithms val formattingString = "%%-%ds %%s" format (algorithms map (_._1.length) max) // Tests val testResults = for ((name,algorithm) <- algorithms) yield formattingString format (name,test(algorithm,sortingAlgorithm._2)) // Benchmarks val arraySizes = List(100,500,1000) def formatResults(results: List[Long]) = results map ("%8d" format _) mkString val benchmarkResults: List[String] = for { (name,algorithm) <- algorithms results <- benchmark(algorithm,arraySizes).transpose } yield formattingString format (name,formatResults(results)) val header = formattingString format ("Algorithm",formatResults(arraySizes.map(_.toLong))) "Tests" :: "*****" :: testResults ::: ("" :: "Benchmark" :: "*********" :: header :: benchmarkResults) mkString ("","n","n") } 结果 测试: Tests ***** Sorting OK,passed 100 tests. Histogram OK,passed 100 tests. Random Pivot OK,passed 100 tests. Median of Medians OK,passed 100 tests. Midpoint OK,passed 100 tests. Random Pivot (in-place)OK,passed 100 tests. Midpoint (in-place) OK,passed 100 tests. 基准: Benchmark ********* Algorithm 100 500 1000 Sorting 1038 6230 14034 Sorting 1037 6223 13777 Sorting 1039 6220 13785 Histogram 2918 11065 21590 Histogram 2596 11046 21486 Histogram 2592 11044 21606 Random Pivot 904 4330 8622 Random Pivot 902 4323 8815 Random Pivot 896 4348 8767 Median of Medians 3591 16857 33307 Median of Medians 3530 16872 33321 Median of Medians 3517 16793 33358 Midpoint 1003 4672 9236 Midpoint 1010 4755 9157 Midpoint 1017 4663 9166 Random Pivot (in-place) 392 1746 3430 Random Pivot (in-place) 386 1747 3424 Random Pivot (in-place) 386 1751 3431 Midpoint (in-place) 378 1735 3405 Midpoint (in-place) 377 1740 3408 Midpoint (in-place) 375 1736 3408 分析 所有算法(排序版本除外)具有与平均线性时间复杂度相兼容的结果。 在最坏情况下保证线性时间复杂度的中位数中位数比随机枢轴慢得多。 固定枢轴选择比随机枢轴稍差,但在非随机输入上可能会有更差的性能。 就地版的速度快了大约230%?250%,但进一步的测试(未显示)似乎表明这个优势随着阵列的大小而增长。 我非常惊讶于直方图算法。它显示线性时间复杂度平均值,也比中位数的中位数快33%。但是,输入是随机的。最糟糕的情况是二次方程式 – 在调试代码时,我看到了一些例子。 (编辑:李大同) 【声明】本站内容均来自网络,其相关言论仅代表作者个人观点,不代表本站立场。若无意侵犯到您的权利,请及时与联系站长删除相关内容! |