hadoop-hive-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Adam Kramer (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HIVE-105) estimate number of required reducers and other map-reduce parameters automatically
Date Fri, 12 Dec 2008 06:42:44 GMT

    [ https://issues.apache.org/jira/browse/HIVE-105?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12655904#action_12655904

Adam Kramer commented on HIVE-105:

It seems to me that n00b hive users (like I was, and maybe still am) would assume (as I did,
until reading this thread) that mapred.reduce.tasks controls the ACTUAL number (not a default,
or a min, or a max). That's always been the case when I've set it, at least.

On the other hand, I can't think of a use case where a user would want to set the min or the
max when they wouldn't be willing to just set the number they wanted. So it is perhaps sufficient
to have the default be "let hive guess" and to assign the exact number of reducers specified
if mapred.reduce.tasks is set at all? Or use another variable for this purpose.

Also, 1 reduce task is not entirely useless...it's useful for calculating a variance until
HIVE-165 comes to pass. =)

> estimate number of required reducers and other map-reduce parameters automatically
> ----------------------------------------------------------------------------------
>                 Key: HIVE-105
>                 URL: https://issues.apache.org/jira/browse/HIVE-105
>             Project: Hadoop Hive
>          Issue Type: Improvement
>          Components: Query Processor
>            Reporter: Joydeep Sen Sarma
> currently users have to specify number of reducers. In a multi-user environment - we
generally ask users to be prudent in selecting number of reducers (since they are long running
and block other users). Also - large number of reducers produce large number of output files
- which puts pressure on namenode resources.
> there are other map-reduce parameters - for example the min split size and the proposed
use of combinefileinputformat that are also fairly tricky for the user to determine (since
they depend on map side selectivity and cluster size). This will become totally critical when
there is integration with BI tools since there will be no opportunity to optimize job settings
and there will be a wide variety of jobs.
> This jira calls for automating the selection of such parameters - possibly by a best
effort at estimating map side selectivity/output size using sampling and determining such
parameters from there.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message