scala – 如何从spark数据帧中过滤掉null值
发布时间:2020-12-16 09:29:35 所属栏目:安全 来源:网络整理
导读:我使用以下模式在spark中创建了一个数据框: root |-- user_id: long (nullable = false) |-- event_id: long (nullable = false) |-- invited: integer (nullable = false) |-- day_diff: long (nullable = true) |-- interested: integer (nullable = fals
我使用以下模式在spark中创建了一个数据框:
root |-- user_id: long (nullable = false) |-- event_id: long (nullable = false) |-- invited: integer (nullable = false) |-- day_diff: long (nullable = true) |-- interested: integer (nullable = false) |-- event_owner: long (nullable = false) |-- friend_id: long (nullable = false) 数据如下所示: +----------+----------+-------+--------+----------+-----------+---------+ | user_id| event_id|invited|day_diff|interested|event_owner|friend_id| +----------+----------+-------+--------+----------+-----------+---------+ | 4236494| 110357109| 0| -1| 0| 937597069| null| | 78065188| 498404626| 0| 0| 0| 2904922087| null| | 282487230|2520855981| 0| 28| 0| 3749735525| null| | 335269852|1641491432| 0| 2| 0| 1490350911| null| | 437050836|1238456614| 0| 2| 0| 991277599| null| | 447244169|2095085551| 0| -1| 0| 1579858878| null| | 516353916|1076364848| 0| 3| 1| 3597645735| null| | 528218683|1151525474| 0| 1| 0| 3433080956| null| | 531967718|3632072502| 0| 1| 0| 3863085861| null| | 627948360|2823119321| 0| 0| 0| 4092665803| null| | 811791433|3513954032| 0| 2| 0| 415464198| null| | 830686203| 99027353| 0| 0| 0| 3549822604| null| |1008893291|1115453150| 0| 2| 0| 2245155244| null| |1239364869|2824096896| 0| 2| 1| 2579294650| null| |1287950172|1076364848| 0| 0| 0| 3597645735| null| |1345896548|2658555390| 0| 1| 0| 2025118823| null| |1354205322|2564682277| 0| 3| 0| 2563033185| null| |1408344828|1255629030| 0| -1| 1| 804901063| null| |1452633375|1334001859| 0| 4| 0| 1488588320| null| |1625052108|3297535757| 0| 3| 0| 1972598895| null| +----------+----------+-------+--------+----------+-----------+---------+ 我想过滤掉“friend_id”字段中的行具有空值。 scala> val aaa = test.filter("friend_id is null") scala> aaa.count 我得到了:res52:Long = 0这显然不对。获得它的正确方法是什么? 还有一个问题,我想替换friend_id字段中的值。我想用0和1替换null,除了null之外的任何其他值。我能弄清楚的代码是: val aaa = train_friend_join.select($"user_id",$"event_id",$"invited",$"day_diff",$"interested",$"event_owner",($"friend_id" != null)?1:0) 此代码也不起作用。任何人都可以告诉我如何解决它?谢谢 解决方法
假设您有此数据设置(以便结果可重现):
// declaring data types case class Company(cName: String,cId: String,details: String) case class Employee(name: String,id: String,email: String,company: Company) // setting up example data val e1 = Employee("n1",null,"n1@c1.com",Company("c1","1","d1")) val e2 = Employee("n2","2","n2@c1.com","d1")) val e3 = Employee("n3","3","n3@c1.com","d1")) val e4 = Employee("n4","4","n4@c2.com",Company("c2","d2")) val e5 = Employee("n5","n5@c2.com","d2")) val e6 = Employee("n6","6","n6@c2.com","d2")) val e7 = Employee("n7","7","n7@c3.com",Company("c3","d3")) val e8 = Employee("n8","8","n8@c3.com","d3")) val employees = Seq(e1,e2,e3,e4,e5,e6,e7,e8) val df = sc.parallelize(employees).toDF 数据是: +----+----+---------+---------+ |name| id| email| company| +----+----+---------+---------+ | n1|null|n1@c1.com|[c1,1,d1]| | n2| 2|n2@c1.com|[c1,d1]| | n3| 3|n3@c1.com|[c1,d1]| | n4| 4|n4@c2.com|[c2,2,d2]| | n5|null|n5@c2.com|[c2,d2]| | n6| 6|n6@c2.com|[c2,d2]| | n7| 7|n7@c3.com|[c3,3,d3]| | n8| 8|n8@c3.com|[c3,d3]| +----+----+---------+---------+ 现在要使用null id过滤员工,你会做 – df.filter("id is null").show 这将正确显示以下内容: +----+----+---------+---------+ |name| id| email| company| +----+----+---------+---------+ | n1|null|n1@c1.com|[c1,d1]| | n5|null|n5@c2.com|[c2,d2]| +----+----+---------+---------+ 来到你的问题的第二部分,你可以用0替换null ids和用其替换1的其他值 – df.withColumn("id",when($"id".isNull,0).otherwise(1)).show 这导致: +----+---+---------+---------+ |name| id| email| company| +----+---+---------+---------+ | n1| 0|n1@c1.com|[c1,d1]| | n2| 1|n2@c1.com|[c1,d1]| | n3| 1|n3@c1.com|[c1,d1]| | n4| 1|n4@c2.com|[c2,d2]| | n5| 0|n5@c2.com|[c2,d2]| | n6| 1|n6@c2.com|[c2,d2]| | n7| 1|n7@c3.com|[c3,d3]| | n8| 1|n8@c3.com|[c3,d3]| +----+---+---------+---------+ (编辑:李大同) 【声明】本站内容均来自网络,其相关言论仅代表作者个人观点,不代表本站立场。若无意侵犯到您的权利,请及时与联系站长删除相关内容! |
相关内容
- scala – 在没有错误状态的iteratee库中处理异常
- angularjs – 逐行加载图像的ng重复,角度js
- function – Scala:val foo =(arg:Type)=> {…} vs. def(
- 为什么scala的`GenTraversableOnce`没有声明`map`?
- 在Scala中,我如何以无状态,功能性的方式建立银行账户?
- shell编程进阶篇
- Shell 命令行,写一个自动整理 ~/Downloads/ 文件夹下文件的
- 再相见 —— Angular
- 关于 webservice 的SoapHeader 示例(赋值代码即可。)
- angularjs – 使用ng-options上的过滤器更改显示的值