hive-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Adam Kramer (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (HIVE-836) Add syntax to force a new mapreduce job / transform subquery in mapper
Date Tue, 26 Jul 2011 23:25:09 GMT

     [ https://issues.apache.org/jira/browse/HIVE-836?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Adam Kramer updated HIVE-836:
-----------------------------

    Description: 
Hive currently does a lot of awesome work to figure out when my transformers should be used
in the mapper and when they should be used in the reducer. However, sometimes I have a different
plan.

For example, consider this:

{code:title=foo.sql}
SELECT TRANSFORM(a.val1, a.val2)
USING './niftyscript'
AS part1, part2, part3
FROM (
    SELECT b.val AS val1, c.val AS val2
    FROM tblb b JOIN tblc c on (b.key=c.key)
) a
{code}

...now, assume that the join step is very easy and 'niftyscript' is really processor intensive.
The ideal format for this is a MR task with few mappers and few reducers, and then a second
MR task with lots of mappers.

Currently, there is no way to even require the outer TRANSFORM statement occur in a separate
map phase. Implementing a "hint" such as /* +MAP */, akin to /* +MAPJOIN(x) */, would be awesome.

Current workaround is to dump everything to a temporary table and then start over, but that
is not an easy to scale--the subquery structure effectively (and easily) "locks" the mid-points
so no other job can touch the table.

  was:
Hive currently does a lot of awesome work to figure out when my transformers should be used
in the mapper and when they should be used in the reducer. However, sometimes I have a different
plan.

For example, consider this:

SELECT TRANSFORM(a.val1, a.val2)
USING './niftyscript'
AS part1, part2, part3
FROM (
    SELECT b.val AS val1, c.val AS val2
    FROM tblb b JOIN tblc c on (b.key=c.key)
) a

...in this syntax b and c will be joined (in the reducer, of course), and then the rows that
pass the join clause will be passed to niftyscript _in the reducer._ However, when niftyscript
is high-computation and there is a lot of data coming out of the join but very few reducers,
there's a huge hold-up. It would be awesome if I could somehow force a new mapreduce step
after the subquery, so that ./niftyscript is run in the mappers rather than the prior step's
reducers.

Current workaround is to dump everything to a temporary table and then start over, but that
is not an easy to scale--the subquery structure effectively (and easily) "locks" the mid-points
so no other job can touch the table.

SUGGESTED FIX: Either cause MAP and REDUCE to force map/reduce steps (c.f. https://issues.apache.org/jira/browse/HIVE-835
), or add a query element to specify that "the job ends here." For example, in the above query,
FROM a SELF-CONTAINED or PRECOMPUTE a or START JOB AFTER a or something like that.



> Add syntax to force a new mapreduce job / transform subquery in mapper
> ----------------------------------------------------------------------
>
>                 Key: HIVE-836
>                 URL: https://issues.apache.org/jira/browse/HIVE-836
>             Project: Hive
>          Issue Type: Wish
>            Reporter: Adam Kramer
>
> Hive currently does a lot of awesome work to figure out when my transformers should be
used in the mapper and when they should be used in the reducer. However, sometimes I have
a different plan.
> For example, consider this:
> {code:title=foo.sql}
> SELECT TRANSFORM(a.val1, a.val2)
> USING './niftyscript'
> AS part1, part2, part3
> FROM (
>     SELECT b.val AS val1, c.val AS val2
>     FROM tblb b JOIN tblc c on (b.key=c.key)
> ) a
> {code}
> ...now, assume that the join step is very easy and 'niftyscript' is really processor
intensive. The ideal format for this is a MR task with few mappers and few reducers, and then
a second MR task with lots of mappers.
> Currently, there is no way to even require the outer TRANSFORM statement occur in a separate
map phase. Implementing a "hint" such as /* +MAP */, akin to /* +MAPJOIN(x) */, would be awesome.
> Current workaround is to dump everything to a temporary table and then start over, but
that is not an easy to scale--the subquery structure effectively (and easily) "locks" the
mid-points so no other job can touch the table.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Mime
View raw message