Mailing-List: contact drill-user-help@incubator.apache.org; run by ezmlm
Precedence: bulk
Reply-To: drill-user@incubator.apache.org
Received-SPF: pass (athena.apache.org: domain of ted.dunning@gmail.com
 designates 74.125.83.47 as permitted sender)
MIME-Version: 1.0
In-Reply-To: 
 <CAFBNzeNghQjvyF_NLt8jK0j159uX60nOpBFJZFJ1gb7jzOB1Pw@mail.gmail.com>
References: 
 <CAFBNzeMQnwdtdgSf00cOkOksgySpr9Pz3zojc8qUKoLFwqwKDg@mail.gmail.com>
 <CAJwFCa0=F+Jf7bT_711quyonpKrPAZjyCZ6uP-d1C20e4vpyMA@mail.gmail.com>
 <CAJwFCa3PHx3mO_r_cz4c8QfttUp_xCTJb3E-z0BQ6f9tqrq2pA@mail.gmail.com>
 <CAFBNzeNghQjvyF_NLt8jK0j159uX60nOpBFJZFJ1gb7jzOB1Pw@mail.gmail.com>
From: Ted Dunning <ted.dunning@gmail.com>
Date: Fri, 26 Oct 2012 18:10:44 -0400
Message-ID: 
 <CAJwFCa0SLzVr0gBc7st+yM5psqC8wYoKfezKzWbLSSc4rLnrgw@mail.gmail.com>
Subject: Re: Nested collections (e.g. JSON arrays) and drill queries
To: drill-user@incubator.apache.org
Content-Type: multipart/alternative; boundary=047d7b603e96235a2704ccfd9908

--047d7b603e96235a2704ccfd9908
Content-Type: text/plain; charset=UTF-8

The physical plan spec as it stands also includes an IMPLODE.  The expected
idiom there would be EXPLODE, FILTER, IMPLODE.  This will retain the
original structure, however.

I think that what you are aiming at is similar to FLATTEN:

https://developers.google.com/bigquery/docs/query-reference#flatten
https://developers.google.com/bigquery/docs/data#flatten

I haven't addressed this yet in the physical plan spec, but it should be
pretty easily done.  The sequence would be something like
EXPLODE/FILTER/FLATTEN to get the result you want and FLATTEN would be
similar to IMPLODE except that it would not glue the exploded field back
together.

(Julian has worried about the default flattening behavior in
Dremel/BigQuery before... I don't know enough to have a strong opinion)

On Fri, Oct 26, 2012 at 5:58 PM, Evan Pollan <evan.pollan@gmail.com> wrote:

> Thanks for the reply, Ted.
>
> What about for the simpler case of treating a nested collection as a
> one-to-many table and leaving the EXPLODE'ed results intact, as if the
> nested collection was JOIN'ed against it's containing record?
>
> E.g. being able to select all the x.y values from the following two
> records:
> { x: [ {y: 1}, {y: 2}, {y: 3} ] }
> { x: [ {y: 2}, {y: 4} ] }
>
> - as -
>
> 1
> 2
> 3
> 2
> 4
>
> In other words, does an EXPLODE always have to be followed by an AGGREGATE.
>
> This statement in the BigQuery reference makes it sound like I might be out
> of luck:
>
> The WITHIN keyword specifically works with aggregate functions to aggregate
> > across children and repeated fields within records and nested fields
>
>
>
>
> On Fri, Oct 26, 2012 at 12:47 AM, Ted Dunning <ted.dunning@gmail.com>
> wrote:
>
> > It it is the within clause that you are interested in, at the physical
> plan
> > layer, this is expressed as EXPLODE/AGGREGATE.  Explode creates a batched
> > data flow which contains values from the nested collection.  The
> aggregate
> > injects the results back into the original records.
> >
> > How this is implemented at the execution layer is more flexible.  The
> > EXPLODE/AGGREGATE pattern could be recognized and optimized into a loop
> > that explicitly does the aggregation, especially for well-known
> aggregates.
> >
> > On Fri, Oct 26, 2012 at 12:43 AM, Ted Dunning <ted.dunning@gmail.com>
> > wrote:
> >
> > > Does the WITHIN clause help?  In BigQuery, this is described here:
> > >
> > > https://developers.google.com/bigquery/docs/query-reference#within
> > >
> > >
> > > On Thu, Oct 25, 2012 at 2:51 PM, Evan Pollan <evan.pollan@gmail.com
> > >wrote:
> > >
> > >> Hi,
> > >>
> > >> I attended Tomer's Strata/HadoopWorld presentation on Drill yesterday,
> > and
> > >> was very impressed.  Lots of features that map directly to my needs.
> > >>
> > >> He specifically cited support for, on the HDFS side, JSON/BSON, avro,
> > and
> > >> sequence files and emphasized the ability to access nested data.  We
> use
> > >> JSON heavily, so it sounds like Drill would support base-case queries
> > over
> > >> nested properties within my dataset.  One question I didn't get the
> > chance
> > >> to ask, though:  what about querying over records with nested
> > collections?
> > >>  For example, I have some JSON datasets that look like:
> > >>
> > >> {
> > >>     "propertyA": "valueA",
> > >>     "propertyB": [
> > >>         {
> > >>             "propertyX": "value1",
> > >>             "propertyY": "value2"
> > >>         },
> > >>         {
> > >>             "propertyX": "value3",
> > >>             "propertyY": "value4"
> > >>         }
> > >>     ]
> > >> }
> > >>
> > >> In this case, I have users that would like to be able to access
> > >> propertyB.propertyX and leverage it in joins and aggregations.  Since
> > each
> > >> record has N propertyB.propertyX values, though, I'm wondering how
> > Drill's
> > >> query planner and execution engine would handle this?
> > >>
> > >> thanks,
> > >> Evan
> > >>
> > >
> > >
> >
>

--047d7b603e96235a2704ccfd9908--