Return-Path: X-Original-To: apmail-tez-issues-archive@minotaur.apache.org Delivered-To: apmail-tez-issues-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id A57E818E53 for ; Wed, 8 Jul 2015 11:43:05 +0000 (UTC) Received: (qmail 50902 invoked by uid 500); 8 Jul 2015 11:43:05 -0000 Delivered-To: apmail-tez-issues-archive@tez.apache.org Received: (qmail 50696 invoked by uid 500); 8 Jul 2015 11:43:05 -0000 Mailing-List: contact issues-help@tez.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@tez.apache.org Delivered-To: mailing list issues@tez.apache.org Received: (qmail 50462 invoked by uid 99); 8 Jul 2015 11:43:05 -0000 Received: from arcas.apache.org (HELO arcas.apache.org) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 08 Jul 2015 11:43:05 +0000 Date: Wed, 8 Jul 2015 11:43:05 +0000 (UTC) From: "TezQA (JIRA)" To: issues@tez.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (TEZ-2496) Consider scheduling tasks in ShuffleVertexManager based on the partition sizes from the source MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/TEZ-2496?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14618460#comment-14618460 ] TezQA commented on TEZ-2496: ---------------------------- {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12744190/TEZ-2496.8.patch against master revision cb59851. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 6 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 3.0.1) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in . Test results: https://builds.apache.org/job/PreCommit-TEZ-Build/890//testReport/ Console output: https://builds.apache.org/job/PreCommit-TEZ-Build/890//console This message is automatically generated. > Consider scheduling tasks in ShuffleVertexManager based on the partition sizes from the source > ---------------------------------------------------------------------------------------------- > > Key: TEZ-2496 > URL: https://issues.apache.org/jira/browse/TEZ-2496 > Project: Apache Tez > Issue Type: Improvement > Reporter: Rajesh Balamohan > Assignee: Rajesh Balamohan > Attachments: TEZ-2496.1.patch, TEZ-2496.2.patch, TEZ-2496.3.patch, TEZ-2496.4.patch, TEZ-2496.5.patch, TEZ-2496.6.patch, TEZ-2496.7.patch, TEZ-2496.8.patch, TEZ-2496.8.patch > > > Consider scheduling tasks in ShuffleVertexManager based on the partition sizes from the source. This would be helpful in scenarios, where there is limited resources (or concurrent jobs running or multiple waves) with dataskew and the task which gets large amount of data gets sceheduled much later. > e.g Consider the following hive query running in a queue with limited capacity (42 slots in total) @ 200 GB scale > {noformat} > CREATE TEMPORARY TABLE sampleData AS > SELECT CASE > WHEN ss_sold_time_sk IS NULL THEN 70429 > ELSE ss_sold_time_sk > END AS ss_sold_time_sk, > ss_item_sk, > ss_customer_sk, > ss_cdemo_sk, > ss_hdemo_sk, > ss_addr_sk, > ss_store_sk, > ss_promo_sk, > ss_ticket_number, > ss_quantity, > ss_wholesale_cost, > ss_list_price, > ss_sales_price, > ss_ext_discount_amt, > ss_ext_sales_price, > ss_ext_wholesale_cost, > ss_ext_list_price, > ss_ext_tax, > ss_coupon_amt, > ss_net_paid, > ss_net_paid_inc_tax, > ss_net_profit, > ss_sold_date_sk > FROM store_sales distribute by ss_sold_time_sk; > {noformat} > This generated 39 maps and 134 reduce slots (3 reduce waves). When lots of nulls are there for ss_sold_time_sk, it would tend to have data skew towards 70429. If the reducer which gets this data gets scheduled much earlier (i.e in first wave itself), entire job would finish fast. -- This message was sent by Atlassian JIRA (v6.3.4#6332)