incubator-drill-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Evan Pollan <evan.pol...@gmail.com>
Subject Re: Nested collections (e.g. JSON arrays) and drill queries
Date Fri, 26 Oct 2012 21:58:56 GMT
Thanks for the reply, Ted.

What about for the simpler case of treating a nested collection as a
one-to-many table and leaving the EXPLODE'ed results intact, as if the
nested collection was JOIN'ed against it's containing record?

E.g. being able to select all the x.y values from the following two records:
{ x: [ {y: 1}, {y: 2}, {y: 3} ] }
{ x: [ {y: 2}, {y: 4} ] }

- as -

1
2
3
2
4

In other words, does an EXPLODE always have to be followed by an AGGREGATE.

This statement in the BigQuery reference makes it sound like I might be out
of luck:

The WITHIN keyword specifically works with aggregate functions to aggregate
> across children and repeated fields within records and nested fields




On Fri, Oct 26, 2012 at 12:47 AM, Ted Dunning <ted.dunning@gmail.com> wrote:

> It it is the within clause that you are interested in, at the physical plan
> layer, this is expressed as EXPLODE/AGGREGATE.  Explode creates a batched
> data flow which contains values from the nested collection.  The aggregate
> injects the results back into the original records.
>
> How this is implemented at the execution layer is more flexible.  The
> EXPLODE/AGGREGATE pattern could be recognized and optimized into a loop
> that explicitly does the aggregation, especially for well-known aggregates.
>
> On Fri, Oct 26, 2012 at 12:43 AM, Ted Dunning <ted.dunning@gmail.com>
> wrote:
>
> > Does the WITHIN clause help?  In BigQuery, this is described here:
> >
> > https://developers.google.com/bigquery/docs/query-reference#within
> >
> >
> > On Thu, Oct 25, 2012 at 2:51 PM, Evan Pollan <evan.pollan@gmail.com
> >wrote:
> >
> >> Hi,
> >>
> >> I attended Tomer's Strata/HadoopWorld presentation on Drill yesterday,
> and
> >> was very impressed.  Lots of features that map directly to my needs.
> >>
> >> He specifically cited support for, on the HDFS side, JSON/BSON, avro,
> and
> >> sequence files and emphasized the ability to access nested data.  We use
> >> JSON heavily, so it sounds like Drill would support base-case queries
> over
> >> nested properties within my dataset.  One question I didn't get the
> chance
> >> to ask, though:  what about querying over records with nested
> collections?
> >>  For example, I have some JSON datasets that look like:
> >>
> >> {
> >>     "propertyA": "valueA",
> >>     "propertyB": [
> >>         {
> >>             "propertyX": "value1",
> >>             "propertyY": "value2"
> >>         },
> >>         {
> >>             "propertyX": "value3",
> >>             "propertyY": "value4"
> >>         }
> >>     ]
> >> }
> >>
> >> In this case, I have users that would like to be able to access
> >> propertyB.propertyX and leverage it in joins and aggregations.  Since
> each
> >> record has N propertyB.propertyX values, though, I'm wondering how
> Drill's
> >> query planner and execution engine would handle this?
> >>
> >> thanks,
> >> Evan
> >>
> >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message