incubator-drill-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ben Flint <bfl...@popcap.com>
Subject Re: Next steps, "hello world" queries, and rants about query languages
Date Wed, 19 Sep 2012 21:17:47 GMT
Julian,
I'm not sure if this will help, but there are several example queries
against a nested dataset here (it's the m-lab dataset):
https://developers.google.com/bigquery/docs/dataset-mlab.  There are also
a few examples in the Experiments section of the Dremel paper, but I'm
sure you've already seen those.

Cheers,
Ben

On 9/19/12 1:33 PM, "Julian Hyde" <julianhyde@gmail.com> wrote:

>On Sep 18, 2012, at 4:09 PM, Ted Dunning <ted.dunning@gmail.com> wrote:
>
>> Excellent.
>> 
>> I think that the key next steps are:
>> 
>> 1) codify some API's (Jason and Julian are on this)
>> 
>> 2) build out the "hello world" app using the available parser and Optiq
>>and
>> a trivial data model (#1 is not a blocker for this)
>> 
>> 3) evaluate where we are implications are relative to nested data and
>> columnar formats
>
>It would be really useful to establish a list of queries. Minimal, but
>demonstrate the key features (e.g. "COUNT(a.b) WITHIN RECORD", the
>ability to aggregate over nested collections.
>
>> Once we have that, we should be able to open up the parallelism of
>> development massively.
>> 
>> Some open questions/tasks that many people may be able to determine
>>include:
>> 
>> - is it possible to change Optiq operators and optimization rules
>>without a
>> recompile?
>
>Yes. (From Optiq's perspective, anyway, it is very straightforward. Optiq
>starts from scratch each time a query is prepared, and takes whatever
>metadata sources, operators and rules it is given.)
>
>> - what has to happen to make Optiq handle nested data?
>
>I am still mulling the answer to that question. The list of queries I
>mentioned above will help me understand what exactly is "nested data" and
>the operations on it.
>
>I have been using 
>http://developers.google.com/bigquery/docs/query-reference as my main
>resource for BigQuery/DrQL. Let me know if there is a more authoritative
>source.
>
>For the record (I'll only make this whine once, I promise) the BigQuery
>language has some usability flaws, particularly inconsistencies with SQL.
>Some of these may have been deliberate (for instance, to try to make the
>language more concise for the user to type), or may have been made out of
>ignorance for SQL.
>
>* The paper has an expression like "COUNT(x > 5)". SQL's COUNT operator
>counts not-NULL values, not boolean TRUE values, so in SQL this
>expression would not have the desired effect
>
>* My jaw dropped when I read "Unlike many other SQL-based systems,
>BigQuery uses the comma syntax to indicate table unions, not joins". I
>see how commas are a convenient abbreviation for UNION, but that's like
>redefining "if" in a programming language.
>
>* "Note that the HAVING clause can only refer to fields defined in your
>SELECT clause (if the field has an alias, you must use it; if it doesn't,
>use the aggregate field name instead)." This kind of design decision
>makes the language difficult to generate for.
>
>* Apparently string literals can be enclosed in either single or double
>quotes. This may be for the convenience of the user, but is at odds with
>SQL, which allows only single quotes. (Many SQL dialects use double
>quotes to quote identifiers that contain mixed case or spaces.)
>
>* BigQuery allows table names to be prefixed with an optional
>"projectname:".
>
>* The syntax for accessing nested collections at depth 2 or more. If "a"
>is a table alias, and b is a collection-valued field, then "a.b" is the
>collection of sub-records, and "a.b.c" is the union of the collections of
>the "c" field of all records in the "a.b" collection. In 'a.b.c', the
>first '.' is the conventional operator that accesses a field of a record.
>But the second '.' is a strange beast that operates on a set of records.
>
>Most of these I offer as evidence that DrQL is not a superset of SQL. I
>suppose we can vive la difference, implement separate parsers/validators
>for the two languages, including implementing DrQL "features" we think
>are ill-advised.
>
>But I was looking to DrQL to find out how to query nested collections,
>and I didn't find much depth. How, for instance, to convert a "dept"
>relation with nested "emp" records into a "flat" relation? Or given an
>"shipment.order.lineitem" nested relation, sum up lineitem.discount only
>in orders that have "order.prepaid=true"? Are nested collections ordered?
>
>SQL's support for nested collections is powerful & consistent. (But not
>very concise, or intuitive for the non-expert.) It has operators like
>UNNEST that convert a nested collection into a relation (note that
>collections and relations are different beasts), and operators such as
>CARDINALITY to aggregate them. (My friend John Sichi wrote a great
>overview here: 
>http://farrago.sourceforge.net/design/CollectionTypes.html.)
>
>To be clear, I am not advocating adding SQL's collection constructs to
>Drill's query language. But I am drawing inspiration from them when
>designing how Optiq would represent collections, because a query planner
>needs semantics to be very, very precise. I can see how most of DrQL's
>constructs would map onto SQL's constructs.
>
>I'll have a better answer to the "what has to happen to make Optiq handle
>nested data" question in a day or two.
>
>Julian


Mime
View raw message