drill-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From paul-rogers <...@git.apache.org>
Subject [GitHub] drill issue #594: DRILL-4842: SELECT * on JSON data results in NumberFormatE...
Date Mon, 20 Feb 2017 23:58:37 GMT
Github user paul-rogers commented on the issue:

    Three general rules to keep in mind in the current JSON reader implementation:
    * Drill can remember the past. (Once a type as been seen for a column, Drill will remember
that type.)
    * Drill cannot predict the future. (If a type has not been seen for a column by the end
of a record batch, Drill cannot predict what type will appear in some later batch.)
    * Drill can amend the past within a single record batch. (If a batch starts with nulls,
but later a type is seen, the previous values are automatically filled with nulls.)
    Actual implementation of the JSON reader, and the value writers that form the implementation,
is complex. As we read JSON values, we ask a type-specific writer to set that value into the
value vector. Each writer marks the column as non-null, then adds the value. Any values not
so set will default to null.
    Consider a file with five null "c1" values followed by a string value "foo" for that field.
The five nulls are ignored. When we see the non-null c1, the code creates a VarChar vector
and sets the 6th value to the string "foo". Doing so automatically marks the previous five
column values as null.
    Suppose we have a file with a single string value "foo" for column "c1", followed by five
nulls. In this case, the first value creates and sets the VarChar vector as before. Later,
at the end of reading the record batch, the reader sets the record count for the vectors.
This action, on the VarChar vector, has the effect of setting the trailing five column values
to null.
    Since values default to null, we get this behavior, and the previous, for free. The result
is that if a record batch contains even a single non-null value for a field, that column will
be fully populated with nulls for all other records in the same batch.
    This gets us back to the same old problem in Drill: if all we see are nulls, Drill needs
to know, "null of what type" while in JSON the value is just null. The JIRA tickets linked
to this ticket all related to that same underlying issue.
    There is a long history of this issue: DRILL-5033, DRILL-1256, DRILL-4479, DRILL-3806
and more.
    This fix affects only "all text mode." This means that, regardless of the JSON type, create
a VarChar column. Doing so provides a very simple fix. Since all columns are VarChar, when
we see a new column, with a null value, just create a VarChar column. (No need to set the
column to null.)
    That is, we can "predict the future" for nulls because *all* columns are VarChar -- so
there is not much to predict.
    Otherwise, we have to stick with Jacques' design decision in DRILL-1256: "Drill's perspective
is a non-existent column and a column with no value are equivalent." A record batch of all
nulls, followed by a record batch with a non-null value, will cause a schema change.
    Again, Drill needs a "null" type that is compatible with all other types in order to support
JSON semantics. (And, needs to differentiate between value-exists-and-is-null and value-does-not-exist.)
    Yet another solution is to have the user tell us their intent. The [JSON Schema](http://jsonschema.net)
project provides a way to express the expected schema so that Drill would know up front the
type of each column (and whether the column is really nullable.)

If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.

View raw message