drill-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Paul Rogers (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (DRILL-6035) Specify Drill's JSON behavior
Date Wed, 27 Dec 2017 07:05:00 GMT

    [ https://issues.apache.org/jira/browse/DRILL-6035?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16293316#comment-16293316
] 

Paul Rogers edited comment on DRILL-6035 at 12/27/17 7:04 AM:
--------------------------------------------------------------

h4. JSON Arrays

Drill supports simple arrays in JSON using the following rules:

* Arrays must contain hetrogeneous elements: any of the scalars described above, or a JSON
object.
* Single-dimensional arrays cannot contain null entries.
* Two-dimensional arrays can contain nulls at the outer level but not the inner level.

(See a later comment for nested arrays.)

For example, the following are scalar arrays:
{code}
[10, 20]
[10.30, 10.45]
["foo", "bar"]
[true, false]
{code}

h4. Schema Change in Arrays

The following will trigger errors:

{code}
{a: [10, "foo"]}  // Mixed types
{a: [10]} {a: ["foo"]} // Schema change
{a: [10, 12.5]} // Conflicting types: integer and float
{code}

h4. Nulls in Arrays

The general rule is that Drill does not support nulls in JSON arrays. However, there are many
odd cases.

If the JSON array is 2+ dimensional, then Drill supports nulls in outer dimensions but not
inner ones. That is, the following is legal:

{code}
{a: [["a", "string"], null, ["another"]]}
{code}

Here, {{null}} behaves the same as an empty array: {{[ ]}}.

Note, however, that the above is *not* valid if the nulls come before the first non-null value:

{code}
{a: [null, ["not", "allowed"]]}
{code}

The reason is that, in the previous example, Drill could determine that the outer list is
a list of string arrays. But, in the second case, Drill cannot tell the array type when it
sees the {{null}}: it might be an array of scalars, so {{null}} is not allowed.

h4. Missing {{LIST}} Support

JSON arrays can contain nulls. Drill provides a (partially completed, inoperable) {{LIST}}
type as described below that handles nulls. But, this vector is not used in Drill 1.12 or
earlier. Instead, Drill uses repeated types which cannot handle nulls. (The {{LIST}} type
is described in a separate note below.)

Using array types, the following rules apply to nulls:

* An array cannot contain nulls.
* An empty array at the start of the file has an unknown type. (Do we select Nullable {{INT}}?)
* An entire array can be null, which is represented as an empty array. (That is, an empty
array and a {{null}} value are considered the same.)

h4. Late Type Identification

As described earlier, Drill 1.13 will defer picking an array type if it sees null values.
For example:

{code}
{id: 1}
{id: 2, a: null}
{id: 3, a: []}
{id: 4, a: [10, 20, 30]}
{code}

In the above example, for id=2, Drill sees column `a` but does not pick a type. For id=3,
Drill identifies that `a` is an array, but does not know the type. Finally, for id=4, Drill
identifies the array as {{BIGINT}}.

h4. Null-Only Arrays

A special case occurs if a JSON file contains only empty arrays or arrays of nulls (such as
a file that contains only the first three records above.)

In Drill 1.12 and earlier, the result is a list of {{LATE}} elements (See the List section
below.) It seems that {{SqlLine}} will correctly show the null values.

An interesting case occurs when Drill reads two files: one with an array with only nulls,
another with real values. For example:

{noformat}
File A: {a: [null, null] }
File B: {a: [10, 20] }
{noformat}

(The above condition can occur only if JSON uses the broken {{LIST}} type; it cannot occur
in Drill 1.12. In 1.12, the equivalent condition is if File A contains:

{noformat}
{a: []}
{noformat}

Drill is distributed: one fragment will read File A, another will read File B. At some point,
the two arrays will come together. One fragment will have created a list of {{LATE}}, another
a list of {{BIGINT}}. Most operators will trigger a schema change error in this case.

Interestingly, however, if the query is a simple {{SELECT *}}, then the lists are compatible
and {{SqlLine}} will display the correct results.

In Drill 1.13, if the first batch contains only nulls and/or empty arrays, Drill guesses that
the type is an array of {{VARCHAR}}. Since this is only a guess, a schema change will result
if the guess is wrong.


was (Author: paul.rogers):
h4. JSON Arrays

Drill supports simple arrays in JSON using the following rules:

* Arrays must contain hetrogeneous elements: any of the scalars described above, or a JSON
object.
* Single-dimensional arrays cannot contain null entries.
* Two-dimensional arrays can contain nulls at the outer level but not the inner level.

(See a later comment for nested arrays.)

For example, the following are scalar arrays:
{code}
[10, 20]
[10.30, 10.45]
["foo", "bar"]
[true, false]
{code}

h4. Schema Change in Arrays

The following will trigger errors:

{code}
{a: [10, "foo"]}  // Mixed types
{a: [10]} {a: ["foo"]} // Schema change
{a: [10, 12.5]} // Conflicting types: integer and float
{code}

h4. Nulls in Arrays

h4. Missing {{LIST}} Support

JSON arrays can contain nulls. Drill provides a (partially completed, inoperable) {{LIST}}
type as described below that handles nulls. But, this vector is not used in Drill 1.12 or
earlier. Instead, Drill uses repeated types which cannot handle nulls. (The {{LIST}} type
is described in a separate note below.)

Using array types, the following rules apply to nulls:

* An array cannot contain nulls.
* An empty array at the start of the file has an unknown type. (Do we select Nullable {{INT}}?)
* An entire array can be null, which is represented as an empty array. (That is, an empty
array and a {{null}} value are considered the same.)

h4. Late Type Identification

As described earlier, Drill 1.13 will defer picking an array type if it sees null values.
For example:

{code}
{id: 1}
{id: 2, a: null}
{id: 3, a: []}
{id: 4, a: [10, 20, 30]}
{code}

In the above example, for id=2, Drill sees column `a` but does not pick a type. For id=3,
Drill identifies that `a` is an array, but does not know the type. Finally, for id=4, Drill
identifies the array as {{BIGINT}}.

h4. Null-Only Arrays

A special case occurs if a JSON file contains only empty arrays or arrays of nulls (such as
a file that contains only the first three records above.)

In Drill 1.12 and earlier, the result is a list of {{LATE}} elements (See the List section
below.) It seems that {{SqlLine}} will correctly show the null values.

An interesting case occurs when Drill reads two files: one with an array with only nulls,
another with real values. For example:

{noformat}
File A: {a: [null, null] }
File B: {a: [10, 20] }
{noformat}

(The above condition can occur only if JSON uses the broken {{LIST}} type; it cannot occur
in Drill 1.12. In 1.12, the equivalent condition is if File A contains:

{noformat}
{a: []}
{noformat}

Drill is distributed: one fragment will read File A, another will read File B. At some point,
the two arrays will come together. One fragment will have created a list of {{LATE}}, another
a list of {{BIGINT}}. Most operators will trigger a schema change error in this case.

Interestingly, however, if the query is a simple {{SELECT *}}, then the lists are compatible
and {{SqlLine}} will display the correct results.

In Drill 1.13, if the first batch contains only nulls and/or empty arrays, Drill guesses that
the type is an array of {{VARCHAR}}. Since this is only a guess, a schema change will result
if the guess is wrong.

> Specify Drill's JSON behavior
> -----------------------------
>
>                 Key: DRILL-6035
>                 URL: https://issues.apache.org/jira/browse/DRILL-6035
>             Project: Apache Drill
>          Issue Type: Improvement
>    Affects Versions: 1.13.0
>            Reporter: Paul Rogers
>            Assignee: Pritesh Maker
>
> Drill supports JSON as its native data format. However, experience suggests that Drill
may have limitations in the JSON that Drill supports. This ticket asks to clarify Drill's
expected behavior on various kinds of JSON.
> Topics to be addressed:
> * Relational vs. non-relational structures
> * JSON structures used in practice and how they map to Drill
> * Support for varying data types
> * Support for missing values, especially across files
> These topics are complex, hence the request to provide a detailed specifications that
clarifies what Drill does and does not support (or what is should and should not support.)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Mime
View raw message