Mailing-List: contact issues-help@drill.apache.org; run by ezmlm
Precedence: bulk
Reply-To: dev@drill.apache.org
Date: Thu, 28 Dec 2017 19:43:00 +0000 (UTC)
From: "Paul Rogers (JIRA)" <jira@apache.org>
To: issues@drill.apache.org
Message-ID: <JIRA.13125330.1513372239000.543063.1514490180216@Atlassian.JIRA>
In-Reply-To: <JIRA.13125330.1513372239000@Atlassian.JIRA>
References: <JIRA.13125330.1513372239000@Atlassian.JIRA> <JIRA.13125330.1513372239639@jira-lw-us.apache.org>
Subject: [jira] [Comment Edited] (DRILL-6035) Specify Drill's JSON behavior
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: quoted-printable
archived-at: Thu, 28 Dec 2017 19:43:06 -0000


    [ https://issues.apache.org/jira/browse/DRILL-6035?page=3Dcom.atlassian=
.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=3D1630=
5689#comment-16305689 ]=20

Paul Rogers edited comment on DRILL-6035 at 12/28/17 7:42 PM:
--------------------------------------------------------------

h4. JSON Translation in Drill

The goal of the JSON reader is to load JSON data into Drill vectors. The ch=
allenge, as shown above, is that the two data models are different. The pro=
blems are two-fold:

* There exist many perfectly valid ways to translate relational data into J=
SON.
* There exist no universal ways to translate arbitrary JSON into a relation=
al model.

That is, JSON can represent any relational table, and do so in a variety of=
 ways. There is a one (table) to many (JSON formats) relationship. Further,=
 there are many JSON structures that do not correspond to tables.

h4. Lack of JSON Support in JDBC and ODBC

The issue is further complicated by the fact that ODBC (Tableau) and JDBC a=
re Drill=E2=80=99s primary client interfaces. These formats do not readily =
support non-tabular data. Thus, not only just Drill successfully consume JS=
ON DAGs, Drill must then convert these structures into simple tables to be =
consumed by BI tools. (That is, Drill is not a document-oriented query engi=
ne, it is a classic tabular query engine.)

h4. Many-to-Many Mappings from JSON to Tables

The challenge of the JSON reader is to convert arbitrary JSON into a relati=
onal model, and to do so with no (or very little) information beyond a list=
 of projected columns and the JSON file itself.

As we will see, the problem is fundamentally not solvable: Drill has too li=
ttle information to correctly map from the many possible (and often conflic=
ting) JSON formats into the proper relational format.

Consider a trivial example.  The following is a perfectly legal representat=
ion of a Customer in JSON:

{code}
[101, =E2=80=9CFred=E2=80=9D, =E2=80=9CBedrock=E2=80=9D, 123.45, =E2=80=9C1=
0-11-12=E2=80=9D]
{code}

Applications sometimes use the above format to conserve space. It works bec=
ause the writer and reader agree on the meaning of each array entry.

How is Drill to interpret the above. More to the point, how is Drill to int=
erpret the above *without a schema*? Said another way, the above format wor=
ks because the writer and reader agree on a format, but Drill is designed t=
o work without that information. Clearly, without a schema, it is impossibl=
e to understand that the above is a terse representation of a row.

Without a schema, all Drill knows is that the above is a heterogeneous arra=
y. How is Drill to know that this array has a one-to-one correspondence to =
columns in a customer record vs. say, an arbitrary array? It can=E2=80=99t.=
 All Drill can do is ask the user to enable =E2=80=9Call-text mode=E2=80=9D=
, read the array as text, then use SQL to project the array entries correct=
ly:

{code}
SELECT CAST(cust[0] AS INTEGER) AS cust_id,
       cust[1] AS cust_name, cust[2] AS city,
       CAST(cust[3] AS FLOAT8) AS balance,
       TO_DATE(cust[4], =E2=80=98yy-mm-dd=E2=80=99) as start_date FROM =E2=
=80=A6
{code}

Although this example is contrived, =E2=80=9CJSON in the wild=E2=80=9D has =
a wide variety of formats since JSON is a universal format and places no co=
nstraints on an application=E2=80=99s data model.

h4. Schema Inference

Translation of JSON to the relational model is simple when the JSON is desi=
gned to exactly fit what Drill expects.=20

Since JSON is Drill=E2=80=99s reference model, Drill attempts to translate =
all valid JSON into the relational model. This is impossible in the general=
 case. Here are just a few of the complexities:

* A run of null values without a non-null value.
* Different data types for an attribute name across objects. (That is, JSON=
 does not enforce a schema.)
* Different data types (as parsed by Drill) for array elements. (Such as th=
e customer row example above.)
* Nulls inside arrays.
* =E2=80=9CSophisticated=E2=80=9D data models such as those described earli=
er

Drill implements a variety of special rules to handle some of the above in =
some special cases. These rules were presented in earlier sections. To summ=
arize a few:

* All-text mode can overcome different primitive types in an object attribu=
te or list (at the cost of extensive casts in the SQL expression.) But, all=
-text mode cannot overcome a change from primitive type to object or list. =
All-text mode applies to all queries within a session, not just to the one =
column with a conflict.
* Using a projection list to avoid projecting a column with conflicting typ=
es (but, then the value is unavailable to Drill queries.) A projection list=
, however, changes the data structure, moving columns nested inside maps to=
 the top level, requiring changes elsewhere in the query to adjust.
* =E2=80=9CNull-deferral=E2=80=9D delays picking a type for a null column u=
ntil a value is seen, but can=E2=80=99t see across a batch boundary.
* =E2=80=9CAll numbers as Float=E2=80=9D handles the case of integer number=
s followed by numbers with a decimal point, but is a session option so appl=
ies to all queries within a session, not just the one column with the confl=
ict.

h4. Distribution Considerations

The issues are further compounded because Drill is a distributed query engi=
ne. Rules and inferences applied by one JSON reader area unknown to other r=
eaders (in other fragments) working on different JSON files for the same qu=
ery. In the simplest case, file A has column `x` which is clearly a FLOAT8.=
 File B, created earlier, has no column `x`, and so the reader guesses null=
able INT (Drill 1.12 or before) or nullable VARCHAR (Drill 1.13.) The resul=
t is conflict elsewhere in the query DAG.

h4. Open Schema Inference Issues

The general conclusion is that Drill suffers from three intrinsic limitatio=
ns:

* Drill cannot predict the future (can=E2=80=99t see a billion rows ahead o=
r predict the contents of files not yet read).
* Drill cannot reverse engineer intent from JSON structure.
* Drill cannot share data across readers (reader of file A cannot coordinat=
e with the reader of file B to agree on a schema.)

The above are not bugs; they are intrinsic characteristic of a distributed =
schema-free query system.

Drill must infer schema on the first record (or, for nulls, in the first ba=
tch.) Information that arrives later (or in another file) is of no help in =
inferring schema. (That is, Drill cannot see into the future.)

There are many ways to encode the same data into JSON. Each JSON encoding c=
ould represent many data formats. Drill has a preference, but that preferen=
ce cannot magically calls all JSON file creators to adopt Drill=E2=80=99s p=
referred format.


was (Author: paul.rogers):
h4. JSON Translation in Drill

The goal of the JSON reader is to load JSON data into Drill vectors. The ch=
allenge, as shown above, is that the two data models are different. The pro=
blems are two-fold:

* There exist many perfectly valid ways to translate relational data into J=
SON.
* There exist no universal ways to translate arbitrary JSON into a relation=
al model.

That is, JSON can represent any relational table, and do so in a variety of=
 ways. There is a one (table) to many (JSON formats) relationship. Further,=
 there are many JSON structures that do not correspond to tables.

h4. Lack of JSON Support in JDBC and ODBC

The issue is further complicated by the fact that ODBC (Tableau) and JDBC a=
re Drill=E2=80=99s primary client interfaces. These formats do not readily =
support non-tabular data. Thus, not only just Drill successfully consume JS=
ON DAGs, Drill must then convert these structures into simple tables to be =
consumed by BI tools. (That is, Drill is not a document-oriented query engi=
ne, it is a classic tabular query engine.)

h4. Many-to-Many Mappings from JSON to Tables

The challenge of the JSON reader is to convert arbitrary JSON into a relati=
onal model, and to do so with no (or very little) information beyond a list=
 of projected columns and the JSON file itself.

As we will see, the problem is fundamentally not solvable: Drill has too li=
ttle information to correctly map from the many possible (and often conflic=
ting) JSON formats into the proper relational format.

Consider a trivial example.  The following is a perfectly legal representat=
ion of a Customer in JSON:

{code}
[101, =E2=80=9CFred=E2=80=9D, =E2=80=9CBedrock=E2=80=9D, 123.45, =E2=80=9C1=
0-11-12=E2=80=9D]
{code}

Applications sometimes use the above format to conserve space. It works bec=
ause the writer and reader agree on the meaning of each array entry.

How is Drill to interpret the above. More to the point, how is Drill to int=
erpret the above *without a schema*? Said another way, the above format wor=
ks because the writer and reader agree on a format, but Drill is designed t=
o work without that information. Clearly, without a schema, it is impossibl=
e to understand that the above is a terse representation of a row.

Without a schema, all Drill knows is that the above is a heterogeneous arra=
y. How is Drill to know that this array has a one-to-one correspondence to =
columns in a customer record vs. say, an arbitrary array? It can=E2=80=99t.=
 All Drill can do is ask the user to enable =E2=80=9Call-text mode=E2=80=9D=
, read the array as text, then use SQL to project the array entries correct=
ly:

{code}
SELECT CAST(cust[0] AS INTEGER) AS cust_id,
       cust[1] AS cust_name, cust[2] AS city,
       CAST(cust[3] AS FLOAT8) AS balance,
       TO_DATE(cust[4], =E2=80=98yy-mm-dd=E2=80=99) as start_date FROM =E2=
=80=A6
{code}

Although this example is contrived, =E2=80=9CJSON in the wild=E2=80=9D has =
a wide variety of formats since JSON is a universal format and places no co=
nstraints on an application=E2=80=99s data model.

> Specify Drill's JSON behavior
> -----------------------------
>
>                 Key: DRILL-6035
>                 URL: https://issues.apache.org/jira/browse/DRILL-6035
>             Project: Apache Drill
>          Issue Type: Improvement
>    Affects Versions: 1.13.0
>            Reporter: Paul Rogers
>            Assignee: Pritesh Maker
>
> Drill supports JSON as its native data format. However, experience sugges=
ts that Drill may have limitations in the JSON that Drill supports. This ti=
cket asks to clarify Drill's expected behavior on various kinds of JSON.
> Topics to be addressed:
> * Relational vs. non-relational structures
> * JSON structures used in practice and how they map to Drill
> * Support for varying data types
> * Support for missing values, especially across files
> These topics are complex, hence the request to provide a detailed specifi=
cations that clarifies what Drill does and does not support (or what is sho=
uld and should not support.)


--
This message was sent by Atlassian JIRA
(v6.4.14#64029)