scala – 当按键分组时,Spark耗尽内存
发布时间:2020-12-16 19:06:44 所属栏目:安全 来源:网络整理
导读:我试图使用 this guide在EC2上使用Spark主机执行常见的爬网数据的简单转换,我的代码如下所示: package ccminerimport org.apache.spark.SparkContextimport org.apache.spark.SparkContext._object ccminer { val english = "english|en|eng" val spanish =
我试图使用
this guide在EC2上使用Spark主机执行常见的爬网数据的简单转换,我的代码如下所示:
package ccminer import org.apache.spark.SparkContext import org.apache.spark.SparkContext._ object ccminer { val english = "english|en|eng" val spanish = "es|esp|spa|spanish|espanol" val turkish = "turkish|tr|tur|turc" val greek = "greek|el|ell" val italian = "italian|it|ita|italien" val all = (english :: spanish :: turkish :: greek :: italian :: Nil).mkString("|") def langIndep(s: String) = s.toLowerCase().replaceAll(all,"*") def main(args: Array[String]): Unit = { if (args.length != 3) { System.err.println("Bad command line") System.exit(-1) } val cluster = "spark://???" val sc = new SparkContext(cluster,"Common Crawl Miner",System.getenv("SPARK_HOME"),Seq("/root/spark/ccminer/target/scala-2.10/cc-miner_2.10-1.0.jar")) sc.sequenceFile[String,String](args(0)).map { case (k,v) => (langIndep(k),v) } .groupByKey(args(2).toInt) .filter { case (_,vs) => vs.size > 1 } .saveAsTextFile(args(1)) } } 我正在运行它的命令如下: sbt/sbt "run-main ccminer.ccminer s3n://aws-publicdatasets/common-crawl/parse-output/segment/1341690165636/textData-* s3n://parallelcorpus/out/ 2000" 但是很快就失败了,错误如下 java.lang.OutOfMemoryError: Java heap space at com.ning.compress.BufferRecycler.allocEncodingBuffer(BufferRecycler.java:59) at com.ning.compress.lzf.ChunkEncoder.<init>(ChunkEncoder.java:93) at com.ning.compress.lzf.impl.UnsafeChunkEncoder.<init>(UnsafeChunkEncoder.java:40) at com.ning.compress.lzf.impl.UnsafeChunkEncoderLE.<init>(UnsafeChunkEncoderLE.java:13) at com.ning.compress.lzf.impl.UnsafeChunkEncoders.createEncoder(UnsafeChunkEncoders.java:31) at com.ning.compress.lzf.util.ChunkEncoderFactory.optimalInstance(ChunkEncoderFactory.java:44) at com.ning.compress.lzf.LZFOutputStream.<init>(LZFOutputStream.java:61) at org.apache.spark.io.LZFCompressionCodec.compressedOutputStream(CompressionCodec.scala:60) at org.apache.spark.storage.BlockManager.wrapForCompression(BlockManager.scala:803) at org.apache.spark.storage.BlockManager$$anonfun$5.apply(BlockManager.scala:471) at org.apache.spark.storage.BlockManager$$anonfun$5.apply(BlockManager.scala:471) at org.apache.spark.storage.DiskBlockObjectWriter.open(BlockObjectWriter.scala:117) at org.apache.spark.storage.DiskBlockObjectWriter.write(BlockObjectWriter.scala:174) at org.apache.spark.scheduler.ShuffleMapTask$$anonfun$runTask$1.apply(ShuffleMapTask.scala:164) at org.apache.spark.scheduler.ShuffleMapTask$$anonfun$runTask$1.apply(ShuffleMapTask.scala:161) at scala.collection.Iterator$class.foreach(Iterator.scala:727) at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:161) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:102) at org.apache.spark.scheduler.Task.run(Task.scala:53) at org.apache.spark.executor.Executor$TaskRunner$$anonfun$run$1.apply$mcV$sp(Executor.scala:213) at org.apache.spark.deploy.SparkHadoopUtil.runAsUser(SparkHadoopUtil.scala:49) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:178) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:744) 所以我的基本问题是,需要写一个Spark任务,可以通过键分组几乎无限量的输入,而不会耗尽内存? 解决方法
随机播放任务中java.lang.OutOfMemoryError异常的最常见原因(如groupByKey,reduceByKey等)为
parallelism的低级别.
您可以通过在configuration中设置spark.default.parallelism属性来增加默认值. (编辑:李大同) 【声明】本站内容均来自网络,其相关言论仅代表作者个人观点,不代表本站立场。若无意侵犯到您的权利,请及时与联系站长删除相关内容! |
相关内容
- Webservice_23_SOAP的基于契约优先开发用户管理_实现客户端
- 在bourne shell中使用user-defined-function中的getopts
- 不能使用bash作为Sensu插件获取jstat指标
- angularjs – Angular Directive ng-click Not Working
- 如何在SCALA中的Row RDD中访问elemens
- Heroku负载均衡器与Netflix zuul
- angularjs – Angular UI-Router模式删除父状态
- bootstrap标准模板
- symfony angularjs登录后InsufficientAuthenticationExcept
- box-sizing Bootstrap