drill-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Paul Rogers (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (DRILL-4824) Null maps / lists and non-provided state support for JSON fields. Numeric types promotion.
Date Sun, 09 Jul 2017 00:21:01 GMT

    [ https://issues.apache.org/jira/browse/DRILL-4824?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16079386#comment-16079386
] 

Paul Rogers commented on DRILL-4824:
------------------------------------

Turns out there is a simple (though inefficient) solution to the null/is-set issue: just add
another "bits" vector.

The existing "bits" vector indicates if the value is set (really, is set and is null.) Add
another vector which identifies if the value is set (to null) or is unset. This alternative
can be backward compatible, but as a result, the semantics are rather convoluted.

The "bits" vector remains 1 if the (non-null) value is set, 0 if the value is null (which,
in current Drill, is the same as not set.)

The new "bits2" vector is 1 if the value is JSON-unset, 0 if null-unset.

Here "JSON-unset" means that JSON-aware operators should consider the value to be unset. All
JSON-unaware operators just look at the existing "bits" vector for the combined unset/null
state.

In short:

|| State || bits value || bits2 value ||
| Set to non-NULL value | 1 | N/A |
| Explicitly null | 1 | 0 |
| Explicitly unset | 1 | 1 |
| Drill NULL | 0 | N/A |

Here, "Drill null" means the existing Drill meaning of NULL: unset or explicitly null.

Given this definition, existing code (including the JDBC drivers) can just ignore "bits2"
and work fine. Only the JSON reader, and the JSON writer, will know how to interpret "bits2."
With the definition above, a missing "bits2" can be interpreted as if "bits2" were present,
but filled with zeros.

The above is, admittedly, a hack. (Recall that the "bits" vector isn't: it is actually bytes,
so we'd now be using 16 bits to encode three states, which is a huge waste.)

We'd still want to move to the full solution explained earlier. To do that, we'd want ensure
that all accesses to "bits" and "bits2" occur though methods on the vector classes. Once this
is done, we can swap out implementations for the more compact, single-vector version. (We'd
also need a solution for older JDBC drivers already deployed in the field: this gets us back
to the client version number issue...)

> Null maps / lists and non-provided state support for JSON fields. Numeric types promotion.
> ------------------------------------------------------------------------------------------
>
>                 Key: DRILL-4824
>                 URL: https://issues.apache.org/jira/browse/DRILL-4824
>             Project: Apache Drill
>          Issue Type: Improvement
>          Components: Storage - JSON
>    Affects Versions: 1.0.0
>            Reporter: Roman
>            Assignee: Volodymyr Vysotskyi
>
> There is incorrect output in case of JSON file with complex nested data.
> _JSON:_
> {code:none|title=example.json|borderStyle=solid}
> {
>         "Field1" : {
>         }
> }
> {
>         "Field1" : {
>                 "InnerField1": {"key1":"value1"},
>                 "InnerField2": {"key2":"value2"}
>         }
> }
> {
>         "Field1" : {
>                 "InnerField3" : ["value3", "value4"],
>                 "InnerField4" : ["value5", "value6"]
>         }
> }
> {code}
> _Query:_
> {code:sql}
> select Field1 from dfs.`/tmp/example.json`
> {code}
> _Incorrect result:_
> {code:none}
> +---------------------------+
> |          Field1           |
> +---------------------------+
> {"InnerField1":{},"InnerField2":{},"InnerField3":[],"InnerField4":[]}
> {"InnerField1":{"key1":"value1"},"InnerField2" {"key2":"value2"},"InnerField3":[],"InnerField4":[]}
> {"InnerField1":{},"InnerField2":{},"InnerField3":["value3","value4"],"InnerField4":["value5","value6"]}
> +--------------------------+
> {code}
> Theres is no need to output missing fields. In case of deeply nested structure we will
get unreadable result for user.
> _Correct result:_
> {code:none}
> +--------------------------+
> |         Field1           |
> +--------------------------+
> |{}                                                                     
> {"InnerField1":{"key1":"value1"},"InnerField2":{"key2":"value2"}}
> {"InnerField3":["value3","value4"],"InnerField4":["value5","value6"]}
> +--------------------------+
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Mime
View raw message