tez-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Rajesh Balamohan (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (TEZ-2496) Consider scheduling tasks in ShuffleVertexManager based on the partition sizes from the source
Date Mon, 06 Jul 2015 10:05:05 GMT

     [ https://issues.apache.org/jira/browse/TEZ-2496?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Rajesh Balamohan updated TEZ-2496:
----------------------------------
    Attachment: TEZ-2496.6.patch

Minor codefix in ShuffleVertexManager.

> Consider scheduling tasks in ShuffleVertexManager based on the partition sizes from the
source
> ----------------------------------------------------------------------------------------------
>
>                 Key: TEZ-2496
>                 URL: https://issues.apache.org/jira/browse/TEZ-2496
>             Project: Apache Tez
>          Issue Type: Improvement
>            Reporter: Rajesh Balamohan
>            Assignee: Rajesh Balamohan
>         Attachments: TEZ-2496.1.patch, TEZ-2496.2.patch, TEZ-2496.3.patch, TEZ-2496.4.patch,
TEZ-2496.5.patch, TEZ-2496.6.patch
>
>
> Consider scheduling tasks in ShuffleVertexManager based on the partition sizes from the
source.  This would be helpful in scenarios, where there is limited resources (or concurrent
jobs running or multiple waves) with dataskew and the task which gets large amount of data
gets sceheduled much later.
> e.g Consider the following hive query running in a queue with limited capacity (42 slots
in total) @ 200 GB scale
> {noformat}
> CREATE TEMPORARY TABLE sampleData AS
>   SELECT CASE
>            WHEN ss_sold_time_sk IS NULL THEN 70429
>            ELSE ss_sold_time_sk
>        END AS ss_sold_time_sk,
>        ss_item_sk,
>        ss_customer_sk,
>        ss_cdemo_sk,
>        ss_hdemo_sk,
>        ss_addr_sk,
>        ss_store_sk,
>        ss_promo_sk,
>        ss_ticket_number,
>        ss_quantity,
>        ss_wholesale_cost,
>        ss_list_price,
>        ss_sales_price,
>        ss_ext_discount_amt,
>        ss_ext_sales_price,
>        ss_ext_wholesale_cost,
>        ss_ext_list_price,
>        ss_ext_tax,
>        ss_coupon_amt,
>        ss_net_paid,
>        ss_net_paid_inc_tax,
>        ss_net_profit,
>        ss_sold_date_sk
>   FROM store_sales distribute by ss_sold_time_sk;
> {noformat}
> This generated 39 maps and 134 reduce slots (3 reduce waves). When lots of nulls are
there for ss_sold_time_sk, it would tend to have data skew towards 70429.  If the reducer
which gets this data gets scheduled much earlier (i.e in first wave itself), entire job would
finish fast.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message