spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "nirav patel (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (SPARK-15845) Expose metrics for sub-stage transformations and action
Date Thu, 09 Jun 2016 16:43:20 GMT

     [ https://issues.apache.org/jira/browse/SPARK-15845?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

nirav patel updated SPARK-15845:
--------------------------------
    Description: 
Spark optimizes DAG processing by efficiently selecting stage boundaries.  This makes spark
stage a sequence of multiple transformation and one or zero action. As Aa result stage that
spark is currently running can be internally series of (map -> shuffle -> map ->
map -> collect) Notice here that it goes pass shuffle dependency and includes the next
transformations and actions into same stage. So any task of this stage is essentially doing
all those transformation/actions as a Unit and there is no further visibility inside it. Basically
network read, populating partitions, compute, shuffle write, shuffle read, compute, writing
final partitions to disk ALL happens within one stage! Means all tasks of that stage is basically
doing all those operations on single partition as a unit. This takes away huge visibility
into users transformation and actions in terms of which one is taking longer or which one
is resource bottleneck and which one is failing.

spark UI just shows its currently running some action stage. If job fails at that point spark
UI just says Action failed but in fact it could be any stage in that lazy chain of evaluation.
Looking at executor logs gives some insights but that's not always straightforward. 

I think we need more visibility into what's happening underneath a task (series of spark transformations/actions
that comprise a stage) so we can easily troubleshoot as well as find bottlenecks and optimize
our DAG.  

PS - Had a positive feedback about this from DataBricks dev team member at SparkSummit. 

  was:
Spark optimizes DAG processing by efficiently selecting stage boundaries.  This makes spark
stage a sequence of multiple transformation and one or zero action. As Aa result stage that
spark is currently running can be internally series of (map -> shuffle -> map ->
map -> collect) Notice here that it goes pass shuffle dependency and includes the next
transformations and actions into same stage. So any task of this stage is essentially doing
all those transformation/actions as a Unit and there is no further visibility inside it. Basically
network read, populating partitions, compute, shuffle write, shuffle read, compute, writing
final partitions to disk ALL happens within one stage! Means all tasks of that stage is basically
doing all those operations on single partition as a unit. This takes away huge visibility
into users transformation and actions in terms of which one is taking longer or which one
is resource bottleneck and which one is failing.

spark UI just shows its currently running some action stage. If job fails at that point spark
UI just says Action failed but in fact it could be any stage in that lazy chain of evaluation.
Looking at executor logs gives some insights but that's not always straightforward. 

I think we need more visibility into what's happening underneath a task (series of spark transformations/actions
that comprise a stage) so we can easily troubleshoot as well as find bottlenecks and optimize
our DAG.  


> Expose metrics for sub-stage transformations and action 
> --------------------------------------------------------
>
>                 Key: SPARK-15845
>                 URL: https://issues.apache.org/jira/browse/SPARK-15845
>             Project: Spark
>          Issue Type: Improvement
>          Components: Spark Core
>    Affects Versions: 1.5.2
>            Reporter: nirav patel
>
> Spark optimizes DAG processing by efficiently selecting stage boundaries.  This makes
spark stage a sequence of multiple transformation and one or zero action. As Aa result stage
that spark is currently running can be internally series of (map -> shuffle -> map ->
map -> collect) Notice here that it goes pass shuffle dependency and includes the next
transformations and actions into same stage. So any task of this stage is essentially doing
all those transformation/actions as a Unit and there is no further visibility inside it. Basically
network read, populating partitions, compute, shuffle write, shuffle read, compute, writing
final partitions to disk ALL happens within one stage! Means all tasks of that stage is basically
doing all those operations on single partition as a unit. This takes away huge visibility
into users transformation and actions in terms of which one is taking longer or which one
is resource bottleneck and which one is failing.
> spark UI just shows its currently running some action stage. If job fails at that point
spark UI just says Action failed but in fact it could be any stage in that lazy chain of evaluation.
Looking at executor logs gives some insights but that's not always straightforward. 
> I think we need more visibility into what's happening underneath a task (series of spark
transformations/actions that comprise a stage) so we can easily troubleshoot as well as find
bottlenecks and optimize our DAG.  
> PS - Had a positive feedback about this from DataBricks dev team member at SparkSummit.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message