spark-reviews mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From MaxGekk <>
Subject [GitHub] spark pull request #22938: [SPARK-25935][SQL] Prevent null rows from JSON pa...
Date Sat, 10 Nov 2018 09:07:12 GMT
Github user MaxGekk commented on a diff in the pull request:
    --- Diff: docs/ ---
    @@ -15,6 +15,8 @@ displayTitle: Spark SQL Upgrading Guide
       - Since Spark 3.0, the `from_json` functions supports two modes - `PERMISSIVE` and
`FAILFAST`. The modes can be set via the `mode` option. The default mode became `PERMISSIVE`.
In previous versions, behavior of `from_json` did not conform to either `PERMISSIVE` nor `FAILFAST`,
especially in processing of malformed JSON records. For example, the JSON string `{"a" 1}`
with the schema `a INT` is converted to `null` by previous versions but Spark 3.0 converts
it to `Row(null)`.
    +  - In Spark version 2.4 and earlier, JSON data source and the `from_json` function produced
`null`s if there is no valid root JSON token in its input (` ` for example). Since Spark 3.0,
such input is treated as a bad record and handled according to specified mode. For example,
in the `PERMISSIVE` mode the ` ` input is converted to `Row(null, null)` if specified schema
is `key STRING, value INT`. 
    --- End diff --
    When we use the data source, we can specify the schema as `StructType` only. In that case,
we get a `Seq[InternalRow]` or `Nil` from JacksonParser which is `flatMap`ped, or `BadRecordException`
which is converted to `Iterator[InternalRow]`. It seems there is no way to get `null` rows.
The difference between JSON datasource and JSON functions is formers don't (and cannot) do
flattening. So, the `Nil` case should be handled especially (this PR addresses the case).


To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message