scala – Seq.contains in Spark Dataframe中的SQL

发布时间：2020-12-16 18:01:27 所属栏目：安全来源：网络整理

导读：我有以下数据结构： id：int 记录：Seq [String] 其他：布尔值在json文件中,为了便于测试： var data = sc.makeRDD(Seq[String]( "{"id":1,"records": ["one","two","three"],"other": true}","{"id": 2,"records": ["two"],"{"id": 3,

我有以下数据结构：

> id：int
>记录：Seq [String]
>其他：布尔值

在json文件中,为了便于测试：

var data = sc.makeRDD(Seq[String](
   "{"id":1,"records": ["one","two","three"],"other": true}","{"id": 2,"records": ["two"],"{"id": 3,"records": ["one"],"other": false }"))
sqlContext.jsonRDD(data).registerTempTable("temp")

而且我想过滤到记录字段中只有一个记录,而其他记录只使用SQL.

我可以通过过滤器(见下文)来做到这一点,但它可以只使用SQL来完成吗？

sqlContext
    .sql("select id,records from temp where other = true")
    .rdd.filter(t => t.getAs[Seq[String]]("records").contains("one"))
    .collect()

解决方法

Spark SQL支持绝大多数Hive功能,因此您可以使用array_contains来完成这项工作：

spark.sql("select id,records from temp where other = true and array_contains(records,'one')").show
# +---+-----------------+
# | id|          records|
# +---+-----------------+
# |  1|[one,two,three]|
# +---+-----------------+

注意：在spark 1.5中,不推荐使用sqlContext.jsonRDD,请使用以下代码：

sqlContext.read.format("json").json(data).registerTempTable("temp")

（编辑：李大同）

【声明】本站内容均来自网络，其相关言论仅代表作者个人观点，不代表本站立场。若无意侵犯到您的权利，请及时与联系站长删除相关内容!