hadoop-hive-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Zheng Shao (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HIVE-105) estimate number of required reducers and other map-reduce parameters automatically
Date Tue, 20 Jan 2009 19:37:00 GMT

    [ https://issues.apache.org/jira/browse/HIVE-105?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12665513#action_12665513

Zheng Shao commented on HIVE-105:

Some detailed plan:

Part A: Automatically detect the number of reducers:
1. If "mapred.reduce.tasks" is set and not less than 0, use that as the number of reducers,
and skip all following steps;
2. Take a look at the total size of the input files, divide that by "hive.exec.bytes.per.reducer",
to get a number R;
3. If "hive.exec.max.reducers" is set and not less than 0, Use min(R, "hive.exec.max.reducers")
as the number of reducers.

NOTE: The user needs to set "mapred.reduce.tasks" to a negative number to take advantage of
this new feature. This is to provide backward compatibility. hadoop-default.xml sets "mapred.reduce.tasks"
so there is no way for the user to "unset" that variable.

Part B: Allow users to override per-stage reducer numbers:
Add step 0 in Part A:
0. If "hive.exec.reducers.stage.N", where "N" is the stage number, is set, then use that as
the number of reducers
Also Part B will add the functionality to remove all parameters like "hive.exec.reducers.stage.N"
after the query is done.

> estimate number of required reducers and other map-reduce parameters automatically
> ----------------------------------------------------------------------------------
>                 Key: HIVE-105
>                 URL: https://issues.apache.org/jira/browse/HIVE-105
>             Project: Hadoop Hive
>          Issue Type: Improvement
>          Components: Query Processor
>            Reporter: Joydeep Sen Sarma
>            Assignee: Zheng Shao
> currently users have to specify number of reducers. In a multi-user environment - we
generally ask users to be prudent in selecting number of reducers (since they are long running
and block other users). Also - large number of reducers produce large number of output files
- which puts pressure on namenode resources.
> there are other map-reduce parameters - for example the min split size and the proposed
use of combinefileinputformat that are also fairly tricky for the user to determine (since
they depend on map side selectivity and cluster size). This will become totally critical when
there is integration with BI tools since there will be no opportunity to optimize job settings
and there will be a wide variety of jobs.
> This jira calls for automating the selection of such parameters - possibly by a best
effort at estimating map side selectivity/output size using sampling and determining such
parameters from there.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message