tez-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Rajesh Balamohan (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (TEZ-1649) ShuffleVertexManager auto reduce parallelism can cause jobs to hang indefinitely (with ScatterGather edges)
Date Thu, 09 Oct 2014 09:14:33 GMT

     [ https://issues.apache.org/jira/browse/TEZ-1649?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Rajesh Balamohan updated TEZ-1649:
----------------------------------
    Attachment: TEZ-1649.png

> ShuffleVertexManager auto reduce parallelism can cause jobs to hang indefinitely (with
ScatterGather edges)
> -----------------------------------------------------------------------------------------------------------
>
>                 Key: TEZ-1649
>                 URL: https://issues.apache.org/jira/browse/TEZ-1649
>             Project: Apache Tez
>          Issue Type: Bug
>            Reporter: Rajesh Balamohan
>         Attachments: TEZ-1649.png
>
>
> Consider the following DAG
>  M1, M2 --> R1
>  M2, M3 --> R2
>  R1 --> R2
> All edges are Scatter-Gather.
>  1. Set R1's (1000 parallelism) min/max setting to 0.25 - 0.5f
>  2. Set R2's (21 parallelism) min/max setting to 0.2 and 0.3f
>  3. Let M1 send some data from HDFS (test.txt)
>  4. Let M2 (50 parallelism) generate some data and send it to R2
>  5. Let M3 (500 parallelism) generate some data and send it to R2
> - Since R2's min/max can get satisfied by getting events from M3 itself, R2 will change
its parallelism quickly than R1.
> - In the mean time, R1 changes its parallelism from 1000 to 20.  This is not propagated
to R2 and it would keep waiting.
> Tested this on a small scale (20 node) cluster and it happens consistently.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message