hive-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Adam Kramer (JIRA)" <j...@apache.org>
Subject [jira] [Created] (HIVE-2295) Implement CLUSTERED BY, DISTRIBUTED BY, SORTED BY directives for a single query level.
Date Wed, 20 Jul 2011 14:04:57 GMT
Implement CLUSTERED BY, DISTRIBUTED BY, SORTED BY directives for a single query level.
--------------------------------------------------------------------------------------

                 Key: HIVE-2295
                 URL: https://issues.apache.org/jira/browse/HIVE-2295
             Project: Hive
          Issue Type: Improvement
          Components: Query Processor
            Reporter: Adam Kramer


The common framework for utilizing the mapreduce framework looks like this:

SELECT TRANSFORM(a.foo, a.bar)
USING 'mapper.py'
AS x, y, z
FROM (
  SELECT b.foo, b.bar
  FROM tablename b
  CLUSTER BY b.foo
) a;

...however, this is exceptionally fragile, as it relies on the assumption that Hive is not
doing any "magic" in between the query steps. People familiar with SQL frequently assume that
query steps are effectively separated from each other. CLUSTER BY, then, would guarantee that
data are clustered on their way OUT of the query, but really what we need is a directive to
indicate that data must be clustered on the way INTO the query.

This is not pedantic, because there is no reason that Hive wouldn't try to optimize data flow
between queries, for example, systematically splitting up big queries. The UDAF framework,
with its merging step, would allow a single key/value pair to be split across SEVERAL reducers,
"violating" the mapreduce assumptions but returning the correct data...however, for a TRANSFORM
statement, no such protections are afforded.

I propose, for greater clarity, that these directives be part of the same query level. Example
syntax:

SELECT TRANSFORM(foo, bar)
USING 'reducer.py'
AS x, y, z
FROM tablename
CLUSTERED BY foo;

...in other words, move the directive regarding data distribution to the query that actually
cares about it, allowing for users who are making the assumptions of the mapreduce framework
to formally indicate that their transformer really DOES need clustered data. Or to put it
in other words, CLUSTER BY is a directive guaranteeing that data are clustered on the way
OUT OF a query (i.e., for bucketed tables), whereas CLUSTERED BY is a directive guaranteeing
that data are clustered on the way INTO a query.

Bonus points: For tables that are already CLUSTERED BY in their definition, allow this query
to run in the map phase.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Mime
View raw message