hive-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Adam Kramer (JIRA)" <>
Subject [jira] Created: (HIVE-836) Add syntax to force a new mapreduce job / transform subquery in mapper
Date Wed, 16 Sep 2009 17:35:57 GMT
Add syntax to force a new mapreduce job / transform subquery in mapper

                 Key: HIVE-836
             Project: Hadoop Hive
          Issue Type: Wish
            Reporter: Adam Kramer

Hive currently does a lot of awesome work to figure out when my transformers should be used
in the mapper and when they should be used in the reducer. However, sometimes I have a different

For example, consider this:

SELECT TRANSFORM(a.val1, a.val2)
USING './niftyscript'
AS part1, part2, part3
    SELECT b.val AS val1, c.val AS val2
    FROM tblb b JOIN tblc c on (b.key=c.key)
) a this syntax b and c will be joined (in the reducer, of course), and then the rows that
pass the join clause will be passed to niftyscript _in the reducer._ However, when niftyscript
is high-computation and there is a lot of data coming out of the join but very few reducers,
there's a huge hold-up. It would be awesome if I could somehow force a new mapreduce step
after the subquery, so that ./niftyscript is run in the mappers rather than the prior step's

Current workaround is to dump everything to a temporary table and then start over, but that
is not an easy to scale--the subquery structure effectively (and easily) "locks" the mid-points
so no other job can touch the table.

SUGGESTED FIX: Either cause MAP and REDUCE to force map/reduce steps (c.f.
), or add a query element to specify that "the job ends here." For example, in the above query,
FROM a SELF-CONTAINED or PRECOMPUTE a or START JOB AFTER a or something like that.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message