drill-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Evan Pollan <evan.pol...@gmail.com>
Subject Re: Nested collections (e.g. JSON arrays) and drill queries
Date Fri, 26 Oct 2012 23:23:46 GMT
Excellent. Thanks for the prompt feedback!


On Oct 26, 2012, at 5:10 PM, Ted Dunning <ted.dunning@gmail.com> wrote:

> The physical plan spec as it stands also includes an IMPLODE.  The expected
> idiom there would be EXPLODE, FILTER, IMPLODE.  This will retain the
> original structure, however.
> 
> I think that what you are aiming at is similar to FLATTEN:
> 
> https://developers.google.com/bigquery/docs/query-reference#flatten
> https://developers.google.com/bigquery/docs/data#flatten
> 
> I haven't addressed this yet in the physical plan spec, but it should be
> pretty easily done.  The sequence would be something like
> EXPLODE/FILTER/FLATTEN to get the result you want and FLATTEN would be
> similar to IMPLODE except that it would not glue the exploded field back
> together.
> 
> (Julian has worried about the default flattening behavior in
> Dremel/BigQuery before... I don't know enough to have a strong opinion)
> 
> On Fri, Oct 26, 2012 at 5:58 PM, Evan Pollan <evan.pollan@gmail.com> wrote:
> 
>> Thanks for the reply, Ted.
>> 
>> What about for the simpler case of treating a nested collection as a
>> one-to-many table and leaving the EXPLODE'ed results intact, as if the
>> nested collection was JOIN'ed against it's containing record?
>> 
>> E.g. being able to select all the x.y values from the following two
>> records:
>> { x: [ {y: 1}, {y: 2}, {y: 3} ] }
>> { x: [ {y: 2}, {y: 4} ] }
>> 
>> - as -
>> 
>> 1
>> 2
>> 3
>> 2
>> 4
>> 
>> In other words, does an EXPLODE always have to be followed by an AGGREGATE.
>> 
>> This statement in the BigQuery reference makes it sound like I might be out
>> of luck:
>> 
>> The WITHIN keyword specifically works with aggregate functions to aggregate
>>> across children and repeated fields within records and nested fields
>> 
>> 
>> 
>> 
>> On Fri, Oct 26, 2012 at 12:47 AM, Ted Dunning <ted.dunning@gmail.com>
>> wrote:
>> 
>>> It it is the within clause that you are interested in, at the physical
>> plan
>>> layer, this is expressed as EXPLODE/AGGREGATE.  Explode creates a batched
>>> data flow which contains values from the nested collection.  The
>> aggregate
>>> injects the results back into the original records.
>>> 
>>> How this is implemented at the execution layer is more flexible.  The
>>> EXPLODE/AGGREGATE pattern could be recognized and optimized into a loop
>>> that explicitly does the aggregation, especially for well-known
>> aggregates.
>>> 
>>> On Fri, Oct 26, 2012 at 12:43 AM, Ted Dunning <ted.dunning@gmail.com>
>>> wrote:
>>> 
>>>> Does the WITHIN clause help?  In BigQuery, this is described here:
>>>> 
>>>> https://developers.google.com/bigquery/docs/query-reference#within
>>>> 
>>>> 
>>>> On Thu, Oct 25, 2012 at 2:51 PM, Evan Pollan <evan.pollan@gmail.com
>>>> wrote:
>>>> 
>>>>> Hi,
>>>>> 
>>>>> I attended Tomer's Strata/HadoopWorld presentation on Drill yesterday,
>>> and
>>>>> was very impressed.  Lots of features that map directly to my needs.
>>>>> 
>>>>> He specifically cited support for, on the HDFS side, JSON/BSON, avro,
>>> and
>>>>> sequence files and emphasized the ability to access nested data.  We
>> use
>>>>> JSON heavily, so it sounds like Drill would support base-case queries
>>> over
>>>>> nested properties within my dataset.  One question I didn't get the
>>> chance
>>>>> to ask, though:  what about querying over records with nested
>>> collections?
>>>>> For example, I have some JSON datasets that look like:
>>>>> 
>>>>> {
>>>>>    "propertyA": "valueA",
>>>>>    "propertyB": [
>>>>>        {
>>>>>            "propertyX": "value1",
>>>>>            "propertyY": "value2"
>>>>>        },
>>>>>        {
>>>>>            "propertyX": "value3",
>>>>>            "propertyY": "value4"
>>>>>        }
>>>>>    ]
>>>>> }
>>>>> 
>>>>> In this case, I have users that would like to be able to access
>>>>> propertyB.propertyX and leverage it in joins and aggregations.  Since
>>> each
>>>>> record has N propertyB.propertyX values, though, I'm wondering how
>>> Drill's
>>>>> query planner and execution engine would handle this?
>>>>> 
>>>>> thanks,
>>>>> Evan
>>>>> 
>>>> 
>>>> 
>>> 
>> 

Mime
View raw message