tez-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Achal Soni (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (TEZ-338) Determine reduce task parallelism
Date Mon, 05 Aug 2013 20:00:56 GMT

    [ https://issues.apache.org/jira/browse/TEZ-338?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13729877#comment-13729877

Achal Soni commented on TEZ-338:

I think this is pretty interesting, and looks like a good approach from my end. 

I was discussing with Hitesh some possibilities along this line for the Weighted Range Partitioner.
Essentially what would happen is each task can maintain and aggregate the histogram of the
key and associated output data distribution. There would also be "user code" (could be in
the form of a virtual vertex) that then consolidates the histogram from each task to produce
an overall view of the data distribution, and determines which ranges are sent to which reducer.

The partitioner supplied to the processor can initially bucket the keys into some configurable
amount of ranges - say 100. Then after the user code has run on the scheduler side, it can
either repartition the buckets (which could be easy as it's essentially slicing and dicing
the different buckets, or if the buckets sizes are small enough, each reducer can be responsible
for a certain range of buckets). 

Certain details of course have to be worked out, but I think it would be awesome if you could
keep this proposal in mind as you start to develop the framework for the reduce task parallelism,
because I think the needs of both features are very similar. 
> Determine reduce task parallelism
> ---------------------------------
>                 Key: TEZ-338
>                 URL: https://issues.apache.org/jira/browse/TEZ-338
>             Project: Apache Tez
>          Issue Type: Sub-task
>            Reporter: Bikas Saha
>              Labels: TEZ-0.2.0
> Determine the parallelism of reduce tasks at runtime. This is important because its difficult
to determine this accurately before the job actually runs due to unknown data reduction ratios
in the intermediate stages.

This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

View raw message