hive-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Adam Kramer (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HIVE-1199) configure total number of mappers
Date Fri, 25 Mar 2011 17:38:05 GMT

    [ https://issues.apache.org/jira/browse/HIVE-1199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13011302#comment-13011302
] 

Adam Kramer commented on HIVE-1199:
-----------------------------------

+1. This is also a bigger issue for automation of jobs that require tweaking the amount of
resources. I have a job right now that needs about 10x the number of mappers to run smoothly,
and I would like to pipeline it, but the data size is growing...so if I configure the split
sizes, I need to do so based on today's size of the table. That should be handled by Hive.

Ideally, this would mean that the split.sizes are generated or recomputed dynamically. One
variable, mapred.map.tasks.approx, could be set or unset...then Hive could do some quick math
based on the size of the table and dynamically set its own mapred.max.split.size and min.split.size
to get approximately the desired number of mappers. Doesn't have to be perfect in order to
be useful!

> configure total number of mappers
> ---------------------------------
>
>                 Key: HIVE-1199
>                 URL: https://issues.apache.org/jira/browse/HIVE-1199
>             Project: Hive
>          Issue Type: Improvement
>          Components: Query Processor
>            Reporter: Namit Jain
>
> For users, it might be very difficult to control the number of mappers. There are many
parameters which confuses the users - 
> for CombineHiveInputFormat, a different set of parameters is required to control the
number of mappers.
> In general, users should have a way to specify the total number of mappers, which should
be obeyed. This will be very difficult
> to guarantee, since the query might be reading from a large number of partitions, where
a mapper can only span one partition.
> What if the number of mappers that the user wants is less than the total number of partitions
?
> It would be a very hueristic to have - a simple usecase that Joy had is as follows:
> A query needs to be run on one table, which has a lot of small files - it will be easy
for him to specify the total number of mappers
> rather than the various rac local/node local combinefileinputformat parameters.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

Mime
View raw message