Return-Path: X-Original-To: apmail-incubator-drill-user-archive@minotaur.apache.org Delivered-To: apmail-incubator-drill-user-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id A484DDA7D for ; Fri, 26 Oct 2012 22:11:42 +0000 (UTC) Received: (qmail 39498 invoked by uid 500); 26 Oct 2012 22:11:42 -0000 Delivered-To: apmail-incubator-drill-user-archive@incubator.apache.org Received: (qmail 39445 invoked by uid 500); 26 Oct 2012 22:11:42 -0000 Mailing-List: contact drill-user-help@incubator.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: drill-user@incubator.apache.org Delivered-To: mailing list drill-user@incubator.apache.org Received: (qmail 39436 invoked by uid 99); 26 Oct 2012 22:11:42 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 26 Oct 2012 22:11:42 +0000 X-ASF-Spam-Status: No, hits=2.5 required=5.0 tests=FREEMAIL_REPLY,HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of ted.dunning@gmail.com designates 74.125.83.47 as permitted sender) Received: from [74.125.83.47] (HELO mail-ee0-f47.google.com) (74.125.83.47) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 26 Oct 2012 22:11:36 +0000 Received: by mail-ee0-f47.google.com with SMTP id t10so1268024eei.6 for ; Fri, 26 Oct 2012 15:11:14 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :content-type; bh=btUpaaE0gIdRi2APGCQabatuxGVsDgFOIhW1MBi9O5g=; b=ph4GOhqI0C+MquF8Ed6KlFY4wNzO5bTt0RLoS0r1X46Ho4XmThvMUuJxqrHS8Pjdpe uTUjVqT2J7Yd7CiQiWyVzT72HEUISeMXMOuwfKaoshZxR3pBweIZz/NvkJwFGQtlP9Rl dAudnp9XNW53QK1esfZzo91qOc9BvIB/Z6+zDeLYWebmaERSEWY/EhF7k3XJeJZQK6be qhkKcvOzOnFBQpXt76bJAn01yUf1ZT8dc3JmSnV2ehCVRP/wqWU0WhkuasYhJq8J27GB qZUablimjuU3q9h5aIOzmqgMpa8wrogKABgRmeeRJzJKE3g7tRhjuz1uJJ+xI/H6Iuu2 xc2Q== Received: by 10.14.175.71 with SMTP id y47mr36588400eel.36.1351289474865; Fri, 26 Oct 2012 15:11:14 -0700 (PDT) MIME-Version: 1.0 Received: by 10.223.74.89 with HTTP; Fri, 26 Oct 2012 15:10:44 -0700 (PDT) In-Reply-To: References: From: Ted Dunning Date: Fri, 26 Oct 2012 18:10:44 -0400 Message-ID: Subject: Re: Nested collections (e.g. JSON arrays) and drill queries To: drill-user@incubator.apache.org Content-Type: multipart/alternative; boundary=047d7b603e96235a2704ccfd9908 X-Virus-Checked: Checked by ClamAV on apache.org --047d7b603e96235a2704ccfd9908 Content-Type: text/plain; charset=UTF-8 The physical plan spec as it stands also includes an IMPLODE. The expected idiom there would be EXPLODE, FILTER, IMPLODE. This will retain the original structure, however. I think that what you are aiming at is similar to FLATTEN: https://developers.google.com/bigquery/docs/query-reference#flatten https://developers.google.com/bigquery/docs/data#flatten I haven't addressed this yet in the physical plan spec, but it should be pretty easily done. The sequence would be something like EXPLODE/FILTER/FLATTEN to get the result you want and FLATTEN would be similar to IMPLODE except that it would not glue the exploded field back together. (Julian has worried about the default flattening behavior in Dremel/BigQuery before... I don't know enough to have a strong opinion) On Fri, Oct 26, 2012 at 5:58 PM, Evan Pollan wrote: > Thanks for the reply, Ted. > > What about for the simpler case of treating a nested collection as a > one-to-many table and leaving the EXPLODE'ed results intact, as if the > nested collection was JOIN'ed against it's containing record? > > E.g. being able to select all the x.y values from the following two > records: > { x: [ {y: 1}, {y: 2}, {y: 3} ] } > { x: [ {y: 2}, {y: 4} ] } > > - as - > > 1 > 2 > 3 > 2 > 4 > > In other words, does an EXPLODE always have to be followed by an AGGREGATE. > > This statement in the BigQuery reference makes it sound like I might be out > of luck: > > The WITHIN keyword specifically works with aggregate functions to aggregate > > across children and repeated fields within records and nested fields > > > > > On Fri, Oct 26, 2012 at 12:47 AM, Ted Dunning > wrote: > > > It it is the within clause that you are interested in, at the physical > plan > > layer, this is expressed as EXPLODE/AGGREGATE. Explode creates a batched > > data flow which contains values from the nested collection. The > aggregate > > injects the results back into the original records. > > > > How this is implemented at the execution layer is more flexible. The > > EXPLODE/AGGREGATE pattern could be recognized and optimized into a loop > > that explicitly does the aggregation, especially for well-known > aggregates. > > > > On Fri, Oct 26, 2012 at 12:43 AM, Ted Dunning > > wrote: > > > > > Does the WITHIN clause help? In BigQuery, this is described here: > > > > > > https://developers.google.com/bigquery/docs/query-reference#within > > > > > > > > > On Thu, Oct 25, 2012 at 2:51 PM, Evan Pollan > >wrote: > > > > > >> Hi, > > >> > > >> I attended Tomer's Strata/HadoopWorld presentation on Drill yesterday, > > and > > >> was very impressed. Lots of features that map directly to my needs. > > >> > > >> He specifically cited support for, on the HDFS side, JSON/BSON, avro, > > and > > >> sequence files and emphasized the ability to access nested data. We > use > > >> JSON heavily, so it sounds like Drill would support base-case queries > > over > > >> nested properties within my dataset. One question I didn't get the > > chance > > >> to ask, though: what about querying over records with nested > > collections? > > >> For example, I have some JSON datasets that look like: > > >> > > >> { > > >> "propertyA": "valueA", > > >> "propertyB": [ > > >> { > > >> "propertyX": "value1", > > >> "propertyY": "value2" > > >> }, > > >> { > > >> "propertyX": "value3", > > >> "propertyY": "value4" > > >> } > > >> ] > > >> } > > >> > > >> In this case, I have users that would like to be able to access > > >> propertyB.propertyX and leverage it in joins and aggregations. Since > > each > > >> record has N propertyB.propertyX values, though, I'm wondering how > > Drill's > > >> query planner and execution engine would handle this? > > >> > > >> thanks, > > >> Evan > > >> > > > > > > > > > --047d7b603e96235a2704ccfd9908--