scala – 在纱线群集上使用带管道的addFile
发布时间:2020-12-16 09:59:49 所属栏目:安全 来源:网络整理
导读:我一直在使用pyspark和我的YARN集群成功.我的工作 执行涉及使用RDD的管道命令通过二进制文件发送数据 我做了.我可以在pyspark这样轻松地做到这一点(假设’sc’是 已定义): sc.addFile("./dumb_prog") t= sc.parallelize(range(10))t.pipe("dumb_prog")t.tak
我一直在使用pyspark和我的YARN集群成功.我的工作
执行涉及使用RDD的管道命令通过二进制文件发送数据 我做了.我可以在pyspark这样轻松地做到这一点(假设’sc’是 已定义): sc.addFile("./dumb_prog") t= sc.parallelize(range(10)) t.pipe("dumb_prog") t.take(10) # Gives expected result 但是,如果我在Scala中做同样的事情,管道命令会得到一个’不能 sc.addFile("./dumb_prog") val t = sc.parallelize(0 until 10) val u = t.pipe("dumb_prog") u.take(10) 为什么这只适用于Python而不适用于Scala?我有办法吗? 以下是scala方面的完整错误消息: [59/3965] 14/09/29 13:07:47 INFO SparkContext: Starting job: take at <console>:17 14/09/29 13:07:47 INFO DAGScheduler: Got job 3 (take at <console>:17) with 1 output partitions (allowLocal=true) 14/09/29 13:07:47 INFO DAGScheduler: Final stage: Stage 3(take at <console>:17) 14/09/29 13:07:47 INFO DAGScheduler: Parents of final stage: List() 14/09/29 13:07:47 INFO DAGScheduler: Missing parents: List() 14/09/29 13:07:47 INFO DAGScheduler: Submitting Stage 3 (PipedRDD[3] at pipe at <console>:14),which has no missing parents 14/09/29 13:07:47 INFO MemoryStore: ensureFreeSpace(2136) called with curMem=7453,maxMem=278302556 14/09/29 13:07:47 INFO MemoryStore: Block broadcast_3 stored as values in memory (estimated size 2.1 KB,free 265.4 MB) 14/09/29 13:07:47 INFO MemoryStore: ensureFreeSpace(1389) called with curMem=9589,maxMem=278302556 14/09/29 13:07:47 INFO MemoryStore: Block broadcast_3_piece0 stored as bytes in memory (estimated size 1389.0 B,free 265.4 MB) 14/09/29 13:07:47 INFO BlockManagerInfo: Added broadcast_3_piece0 in memory on 10.10.0.20:37574 (size: 1389.0 B,free: 265.4 MB) 14/09/29 13:07:47 INFO BlockManagerMaster: Updated info of block broadcast_3_piece0 14/09/29 13:07:47 INFO DAGScheduler: Submitting 1 missing tasks from Stage 3 (PipedRDD[3] at pipe at <console>:14) 14/09/29 13:07:47 INFO YarnClientClusterScheduler: Adding task set 3.0 with 1 tasks 14/09/29 13:07:47 INFO TaskSetManager: Starting task 0.0 in stage 3.0 (TID 6,SERVERNAME,PROCESS_LOCAL,1201 bytes) 14/09/29 13:07:47 INFO BlockManagerInfo: Added broadcast_3_piece0 in memory on SERVERNAME:57118 (size: 1389.0 B,free: 530.3 MB) 14/09/29 13:07:47 WARN TaskSetManager: Lost task 0.0 in stage 3.0 (TID 6,SERVERNAME): java.io.IOException: Cannot run program "dumb_prog": error=2,No such file or directory java.lang.ProcessBuilder.start(ProcessBuilder.java:1041) org.apache.spark.rdd.PipedRDD.compute(PipedRDD.scala:119) org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) org.apache.spark.rdd.RDD.iterator(RDD.scala:229) org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62) org.apache.spark.scheduler.Task.run(Task.scala:54) 解决方法
我在Yarn客户端模式的spark 1.3.0中遇到了类似的问题.当我查看应用程序缓存目录时,即使使用–files,该文件也不会被推送到执行程序.但是当我添加以下内容时,它确实推送给每个执行者:
sc.addFile("dumb_prog",true) t.pipe("./dumb_prog") 我认为这是一个错误,但上面让我解决了这个问题. (编辑:李大同) 【声明】本站内容均来自网络,其相关言论仅代表作者个人观点,不代表本站立场。若无意侵犯到您的权利,请及时与联系站长删除相关内容! |