drill-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Paul Rogers (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (DRILL-6035) Specify Drill's JSON behavior
Date Fri, 15 Dec 2017 21:46:00 GMT

    [ https://issues.apache.org/jira/browse/DRILL-6035?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16293298#comment-16293298

Paul Rogers commented on DRILL-6035:

Drill support for JSON objects consists of three parts.

* Drill supports a non-standard serialized JSON format in which an input file is a sequence
of JSON objects.
* Values within the top-level JSON object give rise to columns within the Drill row.
* Nested objects in JSON give rise to nested {{MAP}} columns in Drill.
* Arrays of JSON objects give rise to {{REPEATED MAP}} columns in Drill.

h4. Top-Level JSON Objects

Example of the expected JSON input format:

{a: 10}
{a: 20}
{a: 30}

Drill allows any amount of white-space between objects. It is common to place each object
on a new line, though this is not required.

Note that Drill allows value names to be quoted or unquoted. The following are both valid:

{"a": 10}
{a: 20}

h4. Drill MAP type

Drill uses the term "Map" to describe JSON objects. However, the {{MAP}} type in Drill is
closer to the {{STRUCT}} type in Impala and Hive. That is, like a {{STRUCT}}, the schema of
all map instances is identical across rows. (This is unlike, say, a JSON object or Python
map in which the members of one instance are independent of those in any other instance.)

As a result, the following example:

{a: {x: 10}}
{a: {y: 20}}

Gives rise to records in Drill with data similar to the following:

{a: {x: 10, y: null}}
{a: {x: null, y: 20}}

h4. JSON Object Arrays and Drill Repeated MAPs

The example below shows a repeated object which gives rise to a {{REPEATED MAP}}:

{a: [{b: 10}, {b: 20}]}

h4. Nulls with JSON Objects

JSON allows null values for a map:

{id: 1, a: {x: 10}}
{id: 2, a: null}
{id: 3}

Drill does not support the concept of a "nullable map". Instead, Drill defines all map members
to be nullable. If the entire object is null (or missing) in JSON, Drill treats this the same
as if every member were null. Thus, in Drill, the following are all equivalent:

{id: 1, a: {x: null, y: null}}
{id: 2, a: {}}
{id: 3, a: {x: null}}
{id: 4, a: null}
{id: 5}

As a side note, when exporting the above data to a JSON file, Drill cannot recreate the original
structure. Instead, it writes all of the above in a common format. (The format has evolved
based on previous bugs, need to investigate the current choice.)

As described for scalars, Drill will defer selecting a type for a column if the initial records
consist only of null values. If a later value is revealed to be a map, Drill will choose the
map type. If the file (or first batch) consists only of nulls, then Drill cannot know the
type and guesses {{VARCHAR}}. This will lead to a schema change error if a later file (or
batch) reveals the type to actually be a map (since {{VARCHAR}} and {{MAP}} are not compatible.)

{id: 1} {id: 2, a: null} {id: 3, a: null}
{id: 4, a: {x: 10, y: 20}}

> Specify Drill's JSON behavior
> -----------------------------
>                 Key: DRILL-6035
>                 URL: https://issues.apache.org/jira/browse/DRILL-6035
>             Project: Apache Drill
>          Issue Type: Improvement
>    Affects Versions: 1.13.0
>            Reporter: Paul Rogers
> Drill supports JSON as its native data format. However, experience suggests that Drill
may have limitations in the JSON that Drill supports. This ticket asks to clarify Drill's
expected behavior on various kinds of JSON.
> Topics to be addressed:
> * Relational vs. non-relational structures
> * JSON structures used in practice and how they map to Drill
> * Support for varying data types
> * Support for missing values, especially across files
> These topics are complex, hence the request to provide a detailed specifications that
clarifies what Drill does and does not support (or what is should and should not support.)

This message was sent by Atlassian JIRA

View raw message