hadoop-hive-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Joydeep Sen Sarma (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HIVE-105) estimate number of required reducers and other map-reduce parameters automatically
Date Sat, 06 Dec 2008 02:55:44 GMT

    [ https://issues.apache.org/jira/browse/HIVE-105?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12654029#action_12654029

Joydeep Sen Sarma commented on HIVE-105:

currently mapred.reduce.tasks controls the 'default' number of reducers. it is in fact expected
that users _would_ override it (since the default - for example - is set to 1 in standard
hadoop configs - which is useless).

i am just afraid of overloading the semantics of well understood hadoop variables. for example
- a n00b hive user (but reasonably experienced with hadoop) might (without reading documentation)
try to increase this parameter (mapred.reduce.tasks) and expect something interesting to happen
- whereas nothing will (since we would still default to 1G/reducer). 

so i would argue for a differently named variable (say: hive.exec.maxreducers) at the minimum.
(I wish hadoop had something equivalent - but since hadoop doesn't determine reducer count
automatically - it makes little sense). if we go this route - i would actually say that we
should forbid the setting of mapred.reduce.tasks (perhaps have a list of hadoop options in
HiveConf that cannot be set by user since they are ignored by hive)

another quick thought - we should try to find a close-by prime number (or alternately a multiple
of large primes perhaps) for the inferred reducers (based on previously observed problems
with skews).

> estimate number of required reducers and other map-reduce parameters automatically
> ----------------------------------------------------------------------------------
>                 Key: HIVE-105
>                 URL: https://issues.apache.org/jira/browse/HIVE-105
>             Project: Hadoop Hive
>          Issue Type: Improvement
>          Components: Query Processor
>            Reporter: Joydeep Sen Sarma
> currently users have to specify number of reducers. In a multi-user environment - we
generally ask users to be prudent in selecting number of reducers (since they are long running
and block other users). Also - large number of reducers produce large number of output files
- which puts pressure on namenode resources.
> there are other map-reduce parameters - for example the min split size and the proposed
use of combinefileinputformat that are also fairly tricky for the user to determine (since
they depend on map side selectivity and cluster size). This will become totally critical when
there is integration with BI tools since there will be no opportunity to optimize job settings
and there will be a wide variety of jobs.
> This jira calls for automating the selection of such parameters - possibly by a best
effort at estimating map side selectivity/output size using sampling and determining such
parameters from there.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message