drill-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Paul Rogers (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (DRILL-6035) Specify Drill's JSON behavior
Date Sat, 16 Dec 2017 01:19:00 GMT

    [ https://issues.apache.org/jira/browse/DRILL-6035?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16293521#comment-16293521
] 

Paul Rogers commented on DRILL-6035:
------------------------------------

h4. Object Key Names

The [JSON Standard|https://www.json.org] identifies an object as:

{code}
{ (string: value)* }
{code}

That is, they key portion of the name/value pair can be an arbitrary string, encoded in UTF-8.

Drill follows the SQL rules for names in SQL statements:

* Names are case insensitive
* Names must follow certain syntax rules (but those roles can be skipped if the name is enclosed
in back-ticks.)
* Names must consist of at least a single character

Drill rules for names in JSON are:

* Names need not be quoted if they are unambiguous.
* Names are considered case insensitive for comparison purposes.

When determining column names:

* If the query includes a {{SELECT *}}, Drill uses the names (and case) specified in JSON.
* If the query includes an explicit projection, {{SELECT x, y z}}, then Drill uses the names
and case specified in the SQL. That is, even if the JSON field names are "X", "Y" and "Z",
Drill will still name the columns `x`, `y` and `z`.

h4. Case Sensitivity Conflicts

The above set up conflicts between JSON and Drill naming rules:

* The names "a" and "A" are distinct in JSON, identical in Drill.
* The string "" is a valid key in JSON, but an invalid name in Drill.

Although Drill allows the use of back-ticks to escape non-standard names, this syntax cannot
be used to overcome Drill's case insensitivity. That is, a of the following match either "x"
or "X":

* {{x}}
* {{X}}
* {{`x`}}
* {{`X`}}

If Drill is presented with a JSON document with names that differ only in case, then the last
name wins. That is, given this input:

{code}
{x: 10, X: 20}
{code}

Drill (in Version 1.13) will not notice the "duplicate" name, but will rather simply overwrite
the first "x" with the second "X", producing a single column "x" with the value 20 for the
first record. In this, Drill follows RFC-7159: "When the names within an object are not unique,
the behavior of software that receives such an object is unpredictable.  Many implementations
report the last name/value pair only."

h4. Empty Names

JSON allows an empty key name:

{code}
{"": 10}
{code}

Drill (in 1.13) will raise an error in this situation.

h4. Leading and Trailing Spaces

JSON keys are arbitrary keys, which means JSON allows keys with leading and trailing spaces:

{code}
{" a": 10, " b": 20, " c ": 30}
{code}

Drill (in 1.13) strips leading and trailing spaces. Thus, a name that consists only of spaces
is considered to be empty. The three names shown above are considered to `a`, `b` and `c`.

> Specify Drill's JSON behavior
> -----------------------------
>
>                 Key: DRILL-6035
>                 URL: https://issues.apache.org/jira/browse/DRILL-6035
>             Project: Apache Drill
>          Issue Type: Improvement
>    Affects Versions: 1.13.0
>            Reporter: Paul Rogers
>            Assignee: Pritesh Maker
>
> Drill supports JSON as its native data format. However, experience suggests that Drill
may have limitations in the JSON that Drill supports. This ticket asks to clarify Drill's
expected behavior on various kinds of JSON.
> Topics to be addressed:
> * Relational vs. non-relational structures
> * JSON structures used in practice and how they map to Drill
> * Support for varying data types
> * Support for missing values, especially across files
> These topics are complex, hence the request to provide a detailed specifications that
clarifies what Drill does and does not support (or what is should and should not support.)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Mime
View raw message