Mailing-List: contact issues-help@tez.incubator.apache.org; run by ezmlm
Precedence: bulk
Reply-To: dev@tez.incubator.apache.org
Date: Mon, 5 Aug 2013 20:00:56 +0000 (UTC)
From: "Achal Soni (JIRA)" <jira@apache.org>
To: issues@tez.incubator.apache.org
Message-ID: <JIRA.12661832.1375690316417.3094.1375732856827@arcas>
In-Reply-To: <JIRA.12661832.1375690316417@arcas>
References: <JIRA.12661832.1375690316417@arcas>
Subject: [jira] [Commented] (TEZ-338) Determine reduce task parallelism
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit


    [ https://issues.apache.org/jira/browse/TEZ-338?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13729877#comment-13729877 ] 

Achal Soni commented on TEZ-338:
--------------------------------

I think this is pretty interesting, and looks like a good approach from my end. 

I was discussing with Hitesh some possibilities along this line for the Weighted Range Partitioner. Essentially what would happen is each task can maintain and aggregate the histogram of the key and associated output data distribution. There would also be "user code" (could be in the form of a virtual vertex) that then consolidates the histogram from each task to produce an overall view of the data distribution, and determines which ranges are sent to which reducer.

The partitioner supplied to the processor can initially bucket the keys into some configurable amount of ranges - say 100. Then after the user code has run on the scheduler side, it can either repartition the buckets (which could be easy as it's essentially slicing and dicing the different buckets, or if the buckets sizes are small enough, each reducer can be responsible for a certain range of buckets). 

Certain details of course have to be worked out, but I think it would be awesome if you could keep this proposal in mind as you start to develop the framework for the reduce task parallelism, because I think the needs of both features are very similar. 
                
> Determine reduce task parallelism
> ---------------------------------
>
>                 Key: TEZ-338
>                 URL: https://issues.apache.org/jira/browse/TEZ-338
>             Project: Apache Tez
>          Issue Type: Sub-task
>            Reporter: Bikas Saha
>              Labels: TEZ-0.2.0
>
> Determine the parallelism of reduce tasks at runtime. This is important because its difficult to determine this accurately before the job actually runs due to unknown data reduction ratios in the intermediate stages.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira