加入收藏 | 设为首页 | 会员中心 | 我要投稿 李大同 (https://www.lidatong.com.cn/)- 科技、建站、经验、云计算、5G、大数据,站长网!
当前位置: 首页 > 综合聚焦 > 服务器 > 安全 > 正文

scala – 如何从spark数据帧中过滤掉null值

发布时间:2020-12-16 09:29:35 所属栏目:安全 来源:网络整理
导读:我使用以下模式在spark中创建了一个数据框: root |-- user_id: long (nullable = false) |-- event_id: long (nullable = false) |-- invited: integer (nullable = false) |-- day_diff: long (nullable = true) |-- interested: integer (nullable = fals
我使用以下模式在spark中创建了一个数据框:

root
 |-- user_id: long (nullable = false)
 |-- event_id: long (nullable = false)
 |-- invited: integer (nullable = false)
 |-- day_diff: long (nullable = true)
 |-- interested: integer (nullable = false)
 |-- event_owner: long (nullable = false)
 |-- friend_id: long (nullable = false)

数据如下所示:

+----------+----------+-------+--------+----------+-----------+---------+
|   user_id|  event_id|invited|day_diff|interested|event_owner|friend_id|
+----------+----------+-------+--------+----------+-----------+---------+
|   4236494| 110357109|      0|      -1|         0|  937597069|     null|
|  78065188| 498404626|      0|       0|         0| 2904922087|     null|
| 282487230|2520855981|      0|      28|         0| 3749735525|     null|
| 335269852|1641491432|      0|       2|         0| 1490350911|     null|
| 437050836|1238456614|      0|       2|         0|  991277599|     null|
| 447244169|2095085551|      0|      -1|         0| 1579858878|     null|
| 516353916|1076364848|      0|       3|         1| 3597645735|     null|
| 528218683|1151525474|      0|       1|         0| 3433080956|     null|
| 531967718|3632072502|      0|       1|         0| 3863085861|     null|
| 627948360|2823119321|      0|       0|         0| 4092665803|     null|
| 811791433|3513954032|      0|       2|         0|  415464198|     null|
| 830686203|  99027353|      0|       0|         0| 3549822604|     null|
|1008893291|1115453150|      0|       2|         0| 2245155244|     null|
|1239364869|2824096896|      0|       2|         1| 2579294650|     null|
|1287950172|1076364848|      0|       0|         0| 3597645735|     null|
|1345896548|2658555390|      0|       1|         0| 2025118823|     null|
|1354205322|2564682277|      0|       3|         0| 2563033185|     null|
|1408344828|1255629030|      0|      -1|         1|  804901063|     null|
|1452633375|1334001859|      0|       4|         0| 1488588320|     null|
|1625052108|3297535757|      0|       3|         0| 1972598895|     null|
+----------+----------+-------+--------+----------+-----------+---------+

我想过滤掉“friend_id”字段中的行具有空值。

scala> val aaa = test.filter("friend_id is null")

scala> aaa.count

我得到了:res52:Long = 0这显然不对。获得它的正确方法是什么?

还有一个问题,我想替换friend_id字段中的值。我想用0和1替换null,除了null之外的任何其他值。我能弄清楚的代码是:

val aaa = train_friend_join.select($"user_id",$"event_id",$"invited",$"day_diff",$"interested",$"event_owner",($"friend_id" != null)?1:0)

此代码也不起作用。任何人都可以告诉我如何解决它?谢谢

解决方法

假设您有此数据设置(以便结果可重现):

// declaring data types
case class Company(cName: String,cId: String,details: String)
case class Employee(name: String,id: String,email: String,company: Company)

// setting up example data
val e1 = Employee("n1",null,"n1@c1.com",Company("c1","1","d1"))
val e2 = Employee("n2","2","n2@c1.com","d1"))
val e3 = Employee("n3","3","n3@c1.com","d1"))
val e4 = Employee("n4","4","n4@c2.com",Company("c2","d2"))
val e5 = Employee("n5","n5@c2.com","d2"))
val e6 = Employee("n6","6","n6@c2.com","d2"))
val e7 = Employee("n7","7","n7@c3.com",Company("c3","d3"))
val e8 = Employee("n8","8","n8@c3.com","d3"))
val employees = Seq(e1,e2,e3,e4,e5,e6,e7,e8)
val df = sc.parallelize(employees).toDF

数据是:

+----+----+---------+---------+
|name|  id|    email|  company|
+----+----+---------+---------+
|  n1|null|n1@c1.com|[c1,1,d1]|
|  n2|   2|n2@c1.com|[c1,d1]|
|  n3|   3|n3@c1.com|[c1,d1]|
|  n4|   4|n4@c2.com|[c2,2,d2]|
|  n5|null|n5@c2.com|[c2,d2]|
|  n6|   6|n6@c2.com|[c2,d2]|
|  n7|   7|n7@c3.com|[c3,3,d3]|
|  n8|   8|n8@c3.com|[c3,d3]|
+----+----+---------+---------+

现在要使用null id过滤员工,你会做 –

df.filter("id is null").show

这将正确显示以下内容:

+----+----+---------+---------+
|name|  id|    email|  company|
+----+----+---------+---------+
|  n1|null|n1@c1.com|[c1,d1]|
|  n5|null|n5@c2.com|[c2,d2]|
+----+----+---------+---------+

来到你的问题的第二部分,你可以用0替换null ids和用其替换1的其他值 –

df.withColumn("id",when($"id".isNull,0).otherwise(1)).show

这导致:

+----+---+---------+---------+
|name| id|    email|  company|
+----+---+---------+---------+
|  n1|  0|n1@c1.com|[c1,d1]|
|  n2|  1|n2@c1.com|[c1,d1]|
|  n3|  1|n3@c1.com|[c1,d1]|
|  n4|  1|n4@c2.com|[c2,d2]|
|  n5|  0|n5@c2.com|[c2,d2]|
|  n6|  1|n6@c2.com|[c2,d2]|
|  n7|  1|n7@c3.com|[c3,d3]|
|  n8|  1|n8@c3.com|[c3,d3]|
+----+---+---------+---------+

(编辑:李大同)

【声明】本站内容均来自网络,其相关言论仅代表作者个人观点,不代表本站立场。若无意侵犯到您的权利,请及时与联系站长删除相关内容!

    推荐文章
      热点阅读