drill-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ted Dunning <ted.dunn...@gmail.com>
Subject Re: JSON reader enhancement
Date Mon, 20 Nov 2017 06:35:01 GMT
I can't speak to all use cases, but the variant map is very important when
dealing with JSON data that has changed schema over time. In order to
determine which version my data is, I need to know which fields are present
in a map. Last time I looked, all that happened is that all films for set
to empty list if they were missing. My thought is that the fields should
not be present at all.

Regarding nested lists that comes up in similar cases where some event has
a list of maps. The primary way that I have to deal with that with drill is
to use kvgen. That's a clumsy mechanism and it leads me to the same problem
of missing fields, but it is doable. Why do we have now doesn't let me do
what I need to do.

On Nov 20, 2017 00:01, "Paul Rogers" <progers@mapr.com> wrote:

> Hi Ted,
>
> Thanks for the suggestions.
>
> To handle nested lists correctly, we need Drill’s List data type, which
> uses Drill’s Union data type. (The List type is really mostly just a
> repeated Union, and so needs union support.) But, the union type disabled
> by default. The case I was trying to handle is to avoid an exception when
> union type is disabled, but a 2D array appears. You make a good case that
> we should leave the existing behavior; which I’ll do.
>
> What you seem to be saying is that the Union type and the List type should
> be completed and enabled by default. Then we need to add the missing
> functionality.
>
> I wonder, in a application, how would all of this be used? If Tableau is
> the primary client, then data is delivered via ODBC. But, ODBC understands
> only ordinary rows and columns; not the JSON types. Would you use Drill to
> convert the JSON structures into simple rows?
>
> What kinds of transforms (functions) would be needed to handle 2D or
> higher arrays? To handle heterogeneous arrays? To handle multiple list
> columns within the same record? Some time back you talked about a
> correlated flatten in which two arrays can be flattened side-by-side. Any
> other use cases?
>
> Or, would the JSON structures be kept in tact, and the data, say, exported
> to other JSON files using CTAS?
>
> Thanks,
>
> - Paul
>
> > On Nov 19, 2017, at 1:42 PM, Ted Dunning <ted.dunning@gmail.com> wrote:
> >
> > I don't see the value with this suggestion. It isn't going to make things
> > much better since the user will be totally stunned when the structure
> > doesn't come through as an array.
> >
> > A bigger issue is the fact that elements of maps aren't marked correctly
> as
> > missing. That means that if I have these two records:
> >
> > "a":{"first": 1, "second": 2}, b:3
> > "a"={"second":20, "third":30}, b:5
> >
> > it is nearly impossible for me to determine whether "first" is missing
> from
> > the second value of "a". This makes Drill impossible to use for lots of
> > variant structure work. The values that Drill provides will have all
> fields
> > marked as present.
> >
> > (at least, this was true at one time).
> >
> >
> >
> >
> >
> > On Sun, Nov 19, 2017 at 4:33 AM, Paul Rogers <progers@mapr.com> wrote:
> >
> >> Hi Arina,
> >>
> >> The proposal is to represent 2D arrays as a string (using the original,
> >> unparsed JSON.) That is, given this input:
> >>
> >> {a: “fred”, b: [[10, 20, 30], [11, 21, 31]]}
> >>
> >> The parsed columns are:
> >>
> >> a, b
> >> “fred”, "[[10, 20, 30], [11, 21, 31]]”
> >>
> >> Notice that column b is just a string. It is a string of JSON, yes, but
> >> still just a string.
> >>
> >> So, the question about kvgen/flatten does not apply here since we are
> not
> >> creating a Drill array.
> >>
> >> There is a very interesting discussion to be had about how Drill
> >> does/should handle “non-relational” JSON structures. But, here, the
> >> suggestions is just for one very simple special case.
> >>
> >> Thanks,
> >>
> >> - Paul
> >>
> >>> On Nov 18, 2017, at 7:15 AM, Arina Yelchiyeva <
> >> arina.yelchiyeva@gmail.com> wrote:
> >>>
> >>> In general sounds good.
> >>> If user will apply kvgen / flatten over such 2-D array columns read as
> >>> string, he will be able to normalize data in the format he wants?
> Right?
> >> Or
> >>> we need to come up with new function?
> >>>
> >>> Kind regards
> >>> Arina
> >>>
> >>> On Fri, Nov 17, 2017 at 10:39 PM, Paul Rogers <progers@mapr.com>
> wrote:
> >>>
> >>>> Hi All,
> >>>>
> >>>> I’d like to propose a minor enhancement to the JSON reader to better
> >>>> handle non-relational JSON structures. (See DRILL-5974 [1].)
> >>>>
> >>>> As background, Drill handles simple tuples:
> >>>>
> >>>> {a: 10, b: “fred”}
> >>>>
> >>>> Drill also handles arrays:
> >>>>
> >>>> {name: “fred”, hobbies: [“bowling”, “golf”]}
> >>>>
> >>>> Drill even handles arrays of tuples:
> >>>>
> >>>> {name: “fred”, orders: [
> >>>> {id: 1001, amount: 12.34},
> >>>> {id: 1002, amount: 56.78}]}
> >>>>
> >>>> The above are termed "relational" because there is a straightforward
> >>>> mapping to/from tables into the above JSON structures.
> >>>>
> >>>> Things get interesting with non-relational types, such as 2-D arrays:
> >>>>
> >>>> {id: 4, shape: “square”, points: [[0, 0], [0, 5], [5, 0], [5, 5]]}
> >>>>
> >>>> Drill has two solutions:
> >>>>
> >>>> * Turn on the experimental list and union support.
> >>>> * Enable all-text mode to read all fields as JSON text.
> >>>>
> >>>> Here, I’d like to propose a middle ground:
> >>>>
> >>>> * Read fields with relational types into vectors.
> >>>> * Read non-relational fields using text mode.
> >>>>
> >>>> Thus, the first three examples would all result in the JSON data
> parsed
> >>>> into Drill vectors. But, the fourth, non-relational example would
> >> produce a
> >>>> row that looks like this:
> >>>>
> >>>> id, shape, points
> >>>> 4, “shape”, “[[0, 0], [0, 5], [5, 0], [5, 5]]”
> >>>>
> >>>> Although Drill can’t parse the 2-D array, Drill will pass the array
> >> along
> >>>> to the client, which can use its favorite JSON parser to parse the
> array
> >>>> and do something useful (like draw the square in this case.)
> >>>>
> >>>> In particular, the proposal:
> >>>>
> >>>> * Apply this change only to the revised “batch size aware” JSON
> reader.
> >>>> * Use the above parsing model by default.
> >>>> * Use the experimental list-and-union support if the existing
> >>>> `exec.enable_union_type` system/session option is set.
> >>>>
> >>>> Existing queries should “just work.” In fact, now JSON with
> >> non-relational
> >>>> types will work “out-of-the-box” without all-text mode or the
> >> experimental
> >>>> types.
> >>>>
> >>>> Thoughts?
> >>>>
> >>>> - Paul
> >>>>
> >>>> [1] https://issues.apache.org/jira/browse/DRILL-5974
> >>>>
> >>>>
> >>>>
> >>
> >>
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message