scala – 如何从每列的列中提取特定元素？

发布时间：2020-12-16 08:46:15 所属栏目：安全来源：网络整理

导读：我在Spark 2.2.0和 Scala 2.11.8中有以下DataFrame. +----------+-------------------------------+|item | other_items |+----------+-------------------------------+| 111 |[[444,1.0],[333,0.5],[666,0.4]]|| 222 |[[444,0.5]] || 333 |[] || 444 |[[11

我在Spark 2.2.0和 Scala 2.11.8中有以下DataFrame.

+----------+-------------------------------+
|item      |        other_items            |
+----------+-------------------------------+
|  111     |[[444,1.0],[333,0.5],[666,0.4]]|
|  222     |[[444,0.5]]          |
|  333     |[]                             |
|  444     |[[111,2.0],[555,[777,0.2]]|

我想获得以下DataFrame：

+----------+-------------+
|item      | other_items |
+----------+-------------+
|  111     | 444         |
|  222     | 444         |
|  444     | 111         |

所以,基本上,我需要从other_items中为每一行提取第一项.另外,我需要忽略那些在other_products中有空array []的行.

我该怎么做？

我试过这种方法,但它没有给我一个预期的结果.

result = df.withColumn("other_items",$"other_items"(0))

printScheme提供以下输出：

|-- item: string (nullable = true)
 |-- other_items: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- _1: string (nullable = true)
 |    |    |-- _2: double (nullable = true)

解决方法

像这样：

val df = Seq(
  ("111",Seq(("111",1.0),("333",0.5),("666",0.4))),Seq())
).toDF("item","other_items")


df.select($"item",$"other_items"(0)("_1").alias("other_items"))
  .na.drop(Seq("other_items")).show

首先应用($“other_items”(0))选择数组的第一个元素,第二个apply(_(“_ 1”))选择_1字段,na.drop删除空数组引入的空值.

+----+-----------+
|item|other_items|
+----+-----------+
| 111|        111|
+----+-----------+

（编辑：李大同）

【声明】本站内容均来自网络，其相关言论仅代表作者个人观点，不代表本站立场。若无意侵犯到您的权利，请及时与联系站长删除相关内容!