scala – 如何在Spark shell中使用带有Apache spark 2.2的s3
我正在尝试从Spark AWS shell中加载来自Amazon AWS S3存储桶的数据.
我咨询过以下资源: Parsing files from Amazon S3 with Apache Spark How to access s3a:// files from Apache Spark? Hortonworks Spark 1.6 and S3 Cloudera Custom s3 endpoints 我已下载并解压缩Apache Spark 2.2.0.在conf / spark-defaults中我有以下内容(注意我替换了access-key和secret-key): spark.hadoop.fs.s3a.impl=org.apache.hadoop.fs.s3a.S3AFileSystem spark.hadoop.fs.s3a.access.key=access-key spark.hadoop.fs.s3a.secret.key=secret-key 我从mvnrepository下载了hadoop-aws-2.8.1.jar和aws-java-sdk-1.11.179.jar,并将它们放在jars /目录中.然后我启动Spark shell: bin/spark-shell --jars jars/hadoop-aws-2.8.1.jar,jars/aws-java-sdk-1.11.179.jar 在shell中,以下是我尝试从S3存储桶加载数据的方法: val p = spark.read.textFile("s3a://sparkcookbook/person") 这是导致的错误: java.lang.NoClassDefFoundError: org/apache/hadoop/fs/GlobalStorageStatistics$StorageStatisticsProvider at java.lang.Class.forName0(Native Method) at java.lang.Class.forName(Class.java:348) at org.apache.hadoop.conf.Configuration.getClassByNameOrNull(Configuration.java:2134) at org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:2099) at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2193) at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2654) at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2667) at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:94) at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2703) at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2685) at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:373) at org.apache.hadoop.fs.Path.getFileSystem(Path.java:295) 当我尝试按如下方式启动Spark shell时: bin/spark-shell --packages org.apache.hadoop:hadoop-aws:2.8.1 然后我得到两个错误:一个是间隔器启动时,另一个是我尝试加载数据时.这是第一个: :: problems summary :: :::: ERRORS unknown resolver null unknown resolver null unknown resolver null unknown resolver null unknown resolver null unknown resolver null :: USE VERBOSE OR DEBUG MESSAGE LEVEL FOR MORE DETAILS 这是第二个: val p = spark.read.textFile("s3a://sparkcookbook/person") java.lang.IllegalAccessError: tried to access method org.apache.hadoop.metrics2.lib.MutableCounterLong.<init>(Lorg/apache/hadoop/metrics2/MetricsInfo;J)V from class org.apache.hadoop.fs.s3a.S3AInstrumentation at org.apache.hadoop.fs.s3a.S3AInstrumentation.streamCounter(S3AInstrumentation.java:195) at org.apache.hadoop.fs.s3a.S3AInstrumentation.streamCounter(S3AInstrumentation.java:216) at org.apache.hadoop.fs.s3a.S3AInstrumentation.<init>(S3AInstrumentation.java:139) at org.apache.hadoop.fs.s3a.S3AFileSystem.initialize(S3AFileSystem.java:174) at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2669) at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:94) at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2703) at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2685) at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:373) at org.apache.hadoop.fs.Path.getFileSystem(Path.java:295) at org.apache.spark.sql.execution.datasources.DataSource.hasMetadata(DataSource.scala:301) at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:344) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:152) at org.apache.spark.sql.DataFrameReader.text(DataFrameReader.scala:506) at org.apache.spark.sql.DataFrameReader.textFile(DataFrameReader.scala:542) at org.apache.spark.sql.DataFrameReader.textFile(DataFrameReader.scala:515) 有人可以建议如何使这个工作?谢谢. 解决方法
如果您使用的是Apache Spark 2.2.0,那么您应该使用hadoop-aws-2.7.3.jar和aws-java-sdk-1.7.4.jar.
$spark-shell --jars jars/hadoop-aws-2.7.3.jar,jars/aws-java-sdk-1.7.4.jar 之后,当您尝试从shell中的S3存储桶加载数据时,您将能够这样做. (编辑:李大同) 【声明】本站内容均来自网络,其相关言论仅代表作者个人观点,不代表本站立场。若无意侵犯到您的权利,请及时与联系站长删除相关内容! |