drill-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Paul Rogers (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (DRILL-4824) JSON with complex nested data produces incorrect output with missing fields
Date Fri, 19 May 2017 18:23:04 GMT

    [ https://issues.apache.org/jira/browse/DRILL-4824?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16017792#comment-16017792

Paul Rogers commented on DRILL-4824:

Thanks for the explanation! Let’s take a step back and extract the requirements/goals from
the implementation outline:

* Allow maps to be nullable.
* Allow evolving the type of a column based on data observed.

Let’s talk a bit more about each one. For the map vector, I agree that we’d need to add
a “bit vector” to track the nullability of the entire map. This will be tricky as it must
be coordinated with each entry in the map: if the map is null, then every vector in the map
must also be null (so that we maintain proper row indexing and keep the offset vectors up-to-date.)
So, from the perspective of existing code a null map, and a map of nulls, is equivalent. For
output, however, a null map would be different, at least for JSON.

We’d have to do the same for arrays. A field foo might be an array in JSON, or null. So,
we’d either need an “isNull” vector for repeated type members, or do the bitmap trick
to add this info to the array offset vector.

Let's think about changing the data type. We can only play the “revise the type based on
new info” game on the first batch. Once the JSON reader sends a batch downstream, changing
the type will be a schema change, which would be fine except that many Drill operators, and
no JDBC/ODBC clients, handle schema changes.

Still, the idea is good for data that has frequent variation within the first batch (first
10K-60K records.) For example, seeing “10, 20, 30, 40.5” would mean that the value could
start as integer, then evolve to double.

Perhaps we can do this in the new mutator created for DRILL-5211. Just start writing data
as one type and silently replace the original vector with a new on one of the new type. We’d
define a “promotion” matrix: vector x can be promoted to y.” (Int to long to double
to decimal, say.)

If we do this in the mutator, then every reader will have the ability to do the same trick.
That is, perhaps only JSON needs the set/not-set flag and nullable arrays and maps. But, all
readers can benefit from the ability to evolve type selection based on observed data.

Changing the data type may require copying: copying the first 5K ints, say, when we discover
that the type is really double. I'd suggest that the cost of copying is acceptable. We copy
the data anyway as we grow vectors. In general, a 16 MB vector (the new max size) will get
that way by doubling from, say, 256K: 256K, 512K, 1M, 2M, 4M, 8M, 16M. (This is something
I hope to improve, but that is another topic.)

The new mutator (vector writers) work by having a single column writer type with methods like
setInt, setLong, setDouble, etc. Internally each “column writer” turns around and calls
a generated, type-specific writer. So, setInt() calls setInt() on the version generated for
IntVector. For an int, all other methods (setLong, setDouble) throw an exception.

To allow type promotion, we’d create a second implementation that would:

* Promote the vector and change writers as needed (calling setDouble on an int vector, say).
* Convert compatible types. (Calling setInt on a double vector, say.)

The result is that the work is completely transparent to the record reader. The record reader
just calls setFoo() for some type Foo, and the mutator does the rest.

If we go that route, we can divide up the work into a number of JIRAs and work out who does
which parts.

> JSON with complex nested data produces incorrect output with missing fields
> ---------------------------------------------------------------------------
>                 Key: DRILL-4824
>                 URL: https://issues.apache.org/jira/browse/DRILL-4824
>             Project: Apache Drill
>          Issue Type: Improvement
>          Components: Storage - JSON
>    Affects Versions: 1.0.0
>            Reporter: Roman
>            Assignee: Volodymyr Vysotskyi
> There is incorrect output in case of JSON file with complex nested data.
> _JSON:_
> {code:none|title=example.json|borderStyle=solid}
> {
>         "Field1" : {
>         }
> }
> {
>         "Field1" : {
>                 "InnerField1": {"key1":"value1"},
>                 "InnerField2": {"key2":"value2"}
>         }
> }
> {
>         "Field1" : {
>                 "InnerField3" : ["value3", "value4"],
>                 "InnerField4" : ["value5", "value6"]
>         }
> }
> {code}
> _Query:_
> {code:sql}
> select Field1 from dfs.`/tmp/example.json`
> {code}
> _Incorrect result:_
> {code:none}
> +---------------------------+
> |          Field1           |
> +---------------------------+
> {"InnerField1":{},"InnerField2":{},"InnerField3":[],"InnerField4":[]}
> {"InnerField1":{"key1":"value1"},"InnerField2" {"key2":"value2"},"InnerField3":[],"InnerField4":[]}
> {"InnerField1":{},"InnerField2":{},"InnerField3":["value3","value4"],"InnerField4":["value5","value6"]}
> +--------------------------+
> {code}
> Theres is no need to output missing fields. In case of deeply nested structure we will
get unreadable result for user.
> _Correct result:_
> {code:none}
> +--------------------------+
> |         Field1           |
> +--------------------------+
> |{}                                                                     
> {"InnerField1":{"key1":"value1"},"InnerField2":{"key2":"value2"}}
> {"InnerField3":["value3","value4"],"InnerField4":["value5","value6"]}
> +--------------------------+
> {code}

This message was sent by Atlassian JIRA

View raw message