scala – Spark 2 Dataset Null值异常

发布时间：2020-12-16 18:45:28 所属栏目：安全来源：网络整理

导读：在spark Dataset.filter中获取此null错误输入CSV： name,age,statabc,22,mxyz,s 工作代码： case class Person(name: String,age: Long,stat: String)val peopleDS = spark.read.option("inferSchema","true") .option("header","true").option("delimiter"

在spark Dataset.filter中获取此null错误

输入CSV：

name,age,stat
abc,22,m
xyz,s

工作代码：

case class Person(name: String,age: Long,stat: String)

val peopleDS = spark.read.option("inferSchema","true")
  .option("header","true").option("delimiter",",")
  .csv("./people.csv").as[Person]
peopleDS.show()
peopleDS.createOrReplaceTempView("people")
spark.sql("select * from people where age > 30").show()

失败的代码(添加以下行返回错误)：

val filteredDS = peopleDS.filter(_.age > 30)
filteredDS.show()

返回null错误

java.lang.RuntimeException: Null value appeared in non-nullable field:
- field (class: "scala.Long",name: "age")
- root class: "com.gcp.model.Person"
If the schema is inferred from a Scala tuple/case class,or a Java bean,please try to use scala.Option[_] or other nullable types (e.g. java.lang.Integer instead of int/scala.Int).

解决方法

你得到的例外应该解释一切,但让我们一步一步走：

>使用csv数据源加载数据时,所有字段都标记为可为空：

val path: String = ???

val peopleDF = spark.read
  .option("inferSchema","true")
  .option("delimiter",")
  .csv(path)

peopleDF.printSchema

root
|-- name: string (nullable = true)
|-- age: integer (nullable = true)
|-- stat: string (nullable = true)

>缺少字段表示为SQL NULL

peopleDF.where($"age".isNull).show

+----+----+----+
|name| age|stat|
+----+----+----+
| xyz|null|   s|
+----+----+----+

>接下来,将数据集[行]转换为使用Long编码年龄字段的数据集[Person]. Scala中的long不能为null.因为输入模式是可空的,所以输出模式保持可为空,尽管如此：

val peopleDS = peopleDF.as[Person]

peopleDS.printSchema

root
 |-- name: string (nullable = true)
 |-- age: integer (nullable = true)
 |-- stat: string (nullable = true)

请注意,它作为[T]根本不会影响架构.
>使用SQL(在已注册表上)或DataFrame API查询数据集时,Spark不会反序列化对象.由于模式仍然可以为空,我们可以执行：

peopleDS.where($"age" > 30).show

+----+---+----+
|name|age|stat|
+----+---+----+
+----+---+----+

没有任何问题.这只是一个简单的SQL逻辑,NULL是一个有效的值.
>当我们使用静态类型的数据集API时：

peopleDS.filter(_.age > 30)

Spark必须反序列化对象.因为Long不能为空(SQL NULL),所以它会失败并且您已经看到异常.

如果不是因为你得到了NPE.
>更正数据的静态类型表示应使用可选类型：

case class Person(name: String,age: Option[Long],stat: String)

调整过滤功能：

peopleDS.filter(_.age.map(_ > 30).getOrElse(false))

+----+---+----+
|name|age|stat|
+----+---+----+
+----+---+----+

如果您愿意,可以使用模式匹配：

peopleDS.filter {
  case Some(age) => age > 30
  case _         => false     // or case None => false
}

请注意,您不必(但无论如何都会建议)使用名称和stat的可选类型.因为Scala String只是一个Java String,所以它可以为null.当然,如果你采用这种方法,你必须明确检查访问的值是否为空.