Return-Path: X-Original-To: apmail-tez-issues-archive@minotaur.apache.org Delivered-To: apmail-tez-issues-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 20CB9C068 for ; Mon, 5 Aug 2013 20:01:26 +0000 (UTC) Received: (qmail 52045 invoked by uid 500); 5 Aug 2013 20:01:25 -0000 Delivered-To: apmail-tez-issues-archive@tez.apache.org Received: (qmail 52009 invoked by uid 500); 5 Aug 2013 20:01:23 -0000 Mailing-List: contact issues-help@tez.incubator.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@tez.incubator.apache.org Delivered-To: mailing list issues@tez.incubator.apache.org Received: (qmail 52002 invoked by uid 99); 5 Aug 2013 20:01:21 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 05 Aug 2013 20:01:21 +0000 X-ASF-Spam-Status: No, hits=-2000.0 required=5.0 tests=ALL_TRUSTED,RP_MATCHES_RCVD X-Spam-Check-By: apache.org Received: from [140.211.11.3] (HELO mail.apache.org) (140.211.11.3) by apache.org (qpsmtpd/0.29) with SMTP; Mon, 05 Aug 2013 20:01:19 +0000 Received: (qmail 51964 invoked by uid 99); 5 Aug 2013 20:00:57 -0000 Received: from arcas.apache.org (HELO arcas.apache.org) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 05 Aug 2013 20:00:57 +0000 Date: Mon, 5 Aug 2013 20:00:56 +0000 (UTC) From: "Achal Soni (JIRA)" To: issues@tez.incubator.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (TEZ-338) Determine reduce task parallelism MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 X-Virus-Checked: Checked by ClamAV on apache.org [ https://issues.apache.org/jira/browse/TEZ-338?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13729877#comment-13729877 ] Achal Soni commented on TEZ-338: -------------------------------- I think this is pretty interesting, and looks like a good approach from my end. I was discussing with Hitesh some possibilities along this line for the Weighted Range Partitioner. Essentially what would happen is each task can maintain and aggregate the histogram of the key and associated output data distribution. There would also be "user code" (could be in the form of a virtual vertex) that then consolidates the histogram from each task to produce an overall view of the data distribution, and determines which ranges are sent to which reducer. The partitioner supplied to the processor can initially bucket the keys into some configurable amount of ranges - say 100. Then after the user code has run on the scheduler side, it can either repartition the buckets (which could be easy as it's essentially slicing and dicing the different buckets, or if the buckets sizes are small enough, each reducer can be responsible for a certain range of buckets). Certain details of course have to be worked out, but I think it would be awesome if you could keep this proposal in mind as you start to develop the framework for the reduce task parallelism, because I think the needs of both features are very similar. > Determine reduce task parallelism > --------------------------------- > > Key: TEZ-338 > URL: https://issues.apache.org/jira/browse/TEZ-338 > Project: Apache Tez > Issue Type: Sub-task > Reporter: Bikas Saha > Labels: TEZ-0.2.0 > > Determine the parallelism of reduce tasks at runtime. This is important because its difficult to determine this accurately before the job actually runs due to unknown data reduction ratios in the intermediate stages. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira