hive-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Adam Kramer (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (HIVE-1251) TRANSFORM should allow piping or allow cross-subquery assumptions.
Date Tue, 26 Jul 2011 23:19:11 GMT

     [ https://issues.apache.org/jira/browse/HIVE-1251?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Adam Kramer updated HIVE-1251:
------------------------------

    Description: 
Many traditional transforms can be accomplished via simple unix commands chained together.
For example, the "sort" phase is an instance of "cut -f 1 | sort". However, the TRANSFORM
command in Hive doesn't allow for unix-style piping to occur.

One classic case where I wish there was piping is when I want to "stack" a column into several
rows:

SELECT TRANSFORM(key, col0, col1, col2) USING 'python stacker.py | python reducer.py' AS key,
value

...in this case, stacker.py would produce output of this form:
key col0
key col1
key col2
...and then the reducer would reduce the above down to one item per key. In this case, the
current workaround is this:

SELECT TRANSFORM(a.key, a.col) USING 'python reducer.py' AS key, value FROM
    (SELECT TRANSFORM(key, col0, col1, col2) USING 'python stacker.py' AS key, col FROM table)

...the problem here is that for the above to work (and it should, indeed, work in a map-only
MR task), I must assume that the data output from one subquery will be passed in EXACTLY THE
SAME FORMAT to the outer query--i.e., I must assume that Hive will not cut a map or reduce
phase in between, or "fan out" data from the inner query into different mappers in the outer
query.

As a user, *I should not be allowed to assume* that data coming out of a subquery goes into
the nodes for a superquery in the same order...ESPECIALLY in the map phase.

  was:
Many traditional transforms can be accomplished via simple unix commands chained together.
For example, the "sort" phase is an instance of "cut -f 1 | sort". However, the TRANSFORM
command in Hive doesn't allow for unix-style piping to occur.

One classic case where I wish there was piping is when I want to "stack" a column into several
rows:

SELECT TRANSFORM(key, col0, col1, col2) USING 'python stacker.py | python reducer.py' AS key,
value

...in this case, stacker.py would produce output of this form:
key col0
key col1
key col2
...and then the reducer would reduce the above down to one item per key. In this case, the
current workaround is this:

SELECT TRANSFORM(a.key, a.col) USING 'python reducer.py' AS key, value FROM
    (SELECT TRANSFORM(key, col0, col1, col2) USING 'python stacker.py' AS key, col FROM table)

...the problem here is that as a user, *I should not be allowed to assume* that the output
from the inner query will be passed DIRECTLY to the outer query (i.e., the outer query should
not assume that it gets the inner query's output on the same box and in the same order). I
know as a programmer that this works fine as a pipe, but when writing Hive code I always wonder--what
if Hive decides to run the inner query in a reduce step, and the outer query in a subsequent
map step?

Broadly, my understanding is that the goal of Hive is to abstract the mapreduce process away
from users. To this end, we have syntax (CLUSTER BY) that allows users to assume that a reduce
task will occur (but see also https://issues.apache.org/jira/browse/HIVE-835 ), but there
is no formal way to force or syntactically assume that the data will NOT be copied or sorted
or transformed. I argue that the only case where this would be necessary or desirable would
be in the instance of a pipe within a transform...ergo a desire for | to work as expected.

An alternative would be for the HQL language definition to explicitly state all conditions
that would cause a task boundary to be crossed (so I can make the strong assumption that if
none of those conditions obtains, my query will be supported in the future)...but that seems
potentially restrictive as the language and Hadoop evolves.


        Summary: TRANSFORM should allow piping or allow cross-subquery assumptions.  (was:
TRANSFORM should allow pipes in some form)

> TRANSFORM should allow piping or allow cross-subquery assumptions.
> ------------------------------------------------------------------
>
>                 Key: HIVE-1251
>                 URL: https://issues.apache.org/jira/browse/HIVE-1251
>             Project: Hive
>          Issue Type: Improvement
>            Reporter: Adam Kramer
>
> Many traditional transforms can be accomplished via simple unix commands chained together.
For example, the "sort" phase is an instance of "cut -f 1 | sort". However, the TRANSFORM
command in Hive doesn't allow for unix-style piping to occur.
> One classic case where I wish there was piping is when I want to "stack" a column into
several rows:
> SELECT TRANSFORM(key, col0, col1, col2) USING 'python stacker.py | python reducer.py'
AS key, value
> ...in this case, stacker.py would produce output of this form:
> key col0
> key col1
> key col2
> ...and then the reducer would reduce the above down to one item per key. In this case,
the current workaround is this:
> SELECT TRANSFORM(a.key, a.col) USING 'python reducer.py' AS key, value FROM
>     (SELECT TRANSFORM(key, col0, col1, col2) USING 'python stacker.py' AS key, col FROM
table)
> ...the problem here is that for the above to work (and it should, indeed, work in a map-only
MR task), I must assume that the data output from one subquery will be passed in EXACTLY THE
SAME FORMAT to the outer query--i.e., I must assume that Hive will not cut a map or reduce
phase in between, or "fan out" data from the inner query into different mappers in the outer
query.
> As a user, *I should not be allowed to assume* that data coming out of a subquery goes
into the nodes for a superquery in the same order...ESPECIALLY in the map phase.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Mime
View raw message