drill-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Paul Rogers (JIRA)" <j...@apache.org>
Subject [jira] [Created] (DRILL-6062) Simplify, Document JSON input format
Date Thu, 28 Dec 2017 21:11:03 GMT
Paul Rogers created DRILL-6062:
----------------------------------

             Summary: Simplify, Document JSON input format
                 Key: DRILL-6062
                 URL: https://issues.apache.org/jira/browse/DRILL-6062
             Project: Apache Drill
          Issue Type: Improvement
            Reporter: Paul Rogers


DRILL-6035 defines the limitations with Drill's 1.12 and 1.13 JSON readers. Many of these
limitations are due to the difficulty of mapping arbitrary JSON documents into a relational
model. Drill has many ad-hoc, partial solutions, but those do not provide complete, production-quality
solutions.

Solutions for full JSON schema mapping are likely beyond what Drill can (or should) achieve.
This ticket suggests we take a different, more realistic approach and simply acknowledge that
Parquet is the best format for Drill, while providing minimal (but solid) JSON support.

h4. Redefine Drill's Target Data Model

Change the Drill web site to explain that Parquet is Drill's target data model. Drill supports
other formats to the degree that they mimic (a subset of) Parquet.

More specifically:

* Drill is a relational, columnar engine.
* Each Drill column must have a single, known data type.
* Drill arrays cannot contain null values.
* Drill supports maps (Parquet structs) and repeated maps
* Drill assumes that the file schema is the same across all files in a data set.

As it turns out, this is exactly the Parquet model.

h4. Redefine Drill's JSON Support

Given the above, redefine the JSON that Drill support to that which follows the Parquet model.
Drill provides no external schema. Instead, the JSON must be structured to provide a single,
clear mapping from the JSON to Drill's internal Parquet format, with no ambiguities:

* Every file consists of a fixed set of objects.
* Lists of scalars (without nulls) or objects.
* Single, consistent type for each name/value pair.
* No null values. (For key/value pairs, omit the pair if the value is null.)
* No empty files.

Of particular concern are files with high "null density": many nulls without declaring a type.
Drill cannot effectively support such files.

h4. External ETL for Non-Compliant JSON

Rather than either a) invest in JSON mapping, or b) allow queries to fail, Drill should encourage
the use of external ETL tools to convert non-compliant JSON into Parquet files. Since most
JSON is ad-hoc, created by and for specific applications, this means most JSON should pass
through an ETL layer into Parquet before being used with Drill.

h4. Simplify the JSON Reader

The JSON reader today attempts to use many partial, ad-hoc fixes to work around some JSON
ambiguity. These hacks are hard to test and maintain, requiring effort that would be better
invested elsewhere. Once we adopt Parquet as the reference format, and define the small, simpler
form of JSON, we can remove the hacks:

* Drop support for unions. (Unions are poorly supported and very complex.)
* Drop support for the {{ListVector}} (which is, essentially, a list of unions and does not
even work.)
* Drop support for multi-dimensional lists. (These do not have any well-defined mapping to
relational tables.)
* Drop support for leading nulls that span batches. (That is, the type of every value must
be revealed within the first batch.)
* Drop support for empty files. (Drill needs a schema internally. Drill invents a fake schema
today, but that just causes a schema change later. If desired, simply ignore such files rather
than failing the query.)

h4. Implications for Drill 1.13 "Result Set Loader"

Much work was done to try to extend the result set loader to handle JSON ambiguities. The
List Vector, Repeated List Vector and Union Vectors were all implemented, leading to a vast
increase in complexity. If we adopt the above, this work can be backed out, resulting in a
smaller, more efficient, streamlined core. In short, remove the poorly-supported components
used only by JSON, keeping the types and mechanisms needed for Parquet (and Drill's internal
operators.)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Mime
View raw message