hive-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Adam Kramer (JIRA)" <>
Subject [jira] [Commented] (HIVE-836) Add syntax to force a new mapreduce job / transform subquery in mapper
Date Fri, 09 Jan 2015 01:56:35 GMT


Adam Kramer commented on HIVE-836:

Oh hey there five year old task.

Workaround: Use CLUSTER BY to force a reduce phase, and a staging table to force a map phase.
Hive writes all the data to disk in every phase anyway so the staging table isn't actually
a performance hit.

Also protip: DON'T get distracted by the Hive keywords "MAP" and "REDUCE", they are just synonyms
for TRANSFORM and do not do what anybody expects.

> Add syntax to force a new mapreduce job / transform subquery in mapper
> ----------------------------------------------------------------------
>                 Key: HIVE-836
>                 URL:
>             Project: Hive
>          Issue Type: Wish
>            Reporter: Adam Kramer
> Hive currently does a lot of awesome work to figure out when my transformers should be
used in the mapper and when they should be used in the reducer. However, sometimes I have
a different plan.
> For example, consider this:
> {code:title=foo.sql}
> SELECT TRANSFORM(a.val1, a.val2)
> USING './niftyscript'
> AS part1, part2, part3
> FROM (
>     SELECT b.val AS val1, c.val AS val2
>     FROM tblb b JOIN tblc c on (b.key=c.key)
> ) a
> {code}
>, assume that the join step is very easy and 'niftyscript' is really processor
intensive. The ideal format for this is a MR task with few mappers and few reducers, and then
a second MR task with lots of mappers.
> Currently, there is no way to even require the outer TRANSFORM statement occur in a separate
map phase. Implementing a "hint" such as /* +MAP */, akin to /* +MAPJOIN(x) */, would be awesome.
> Current workaround is to dump everything to a temporary table and then start over, but
that is not an easy to scale--the subquery structure effectively (and easily) "locks" the
mid-points so no other job can touch the table.

This message was sent by Atlassian JIRA

View raw message