spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Reynold Xin (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (SPARK-16827) Stop reporting spill metrics as shuffle metrics
Date Thu, 13 Oct 2016 03:43:20 GMT

     [ https://issues.apache.org/jira/browse/SPARK-16827?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Reynold Xin updated SPARK-16827:
--------------------------------
    Fix Version/s: 2.0.2

> Stop reporting spill metrics as shuffle metrics
> -----------------------------------------------
>
>                 Key: SPARK-16827
>                 URL: https://issues.apache.org/jira/browse/SPARK-16827
>             Project: Spark
>          Issue Type: Bug
>          Components: Shuffle, Spark Core
>    Affects Versions: 2.0.0
>            Reporter: Sital Kedia
>            Assignee: Brian Cho
>              Labels: performance
>             Fix For: 2.0.2, 2.1.0
>
>
> One of our hive job which looks like this -
> {code}
>  SELECT  userid
>      FROM  table1 a
>      JOIN table2 b
>       ON    a.ds = '2016-07-15'
>       AND  b.ds = '2016-07-15'
>       AND  a.source_id = b.id
> {code}
> After upgrade to Spark 2.0 the job is significantly slow.  Digging a little into it,
we found out that one of the stages produces excessive amount of shuffle data.  Please note
that this is a regression from Spark 1.6. Stage 2 of the job which used to produce 32KB shuffle
data with 1.6, now produces more than 400GB with Spark 2.0. We also tried turning off whole
stage code generation but that did not help. 
> PS - Even if the intermediate shuffle data size is huge, the job still produces accurate
output.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message