spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ioannis Deligiannis (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (SPARK-13718) Scheduler "creating" straggler node
Date Mon, 07 Mar 2016 11:48:40 GMT

    [ https://issues.apache.org/jira/browse/SPARK-13718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15182901#comment-15182901
] 

Ioannis Deligiannis edited comment on SPARK-13718 at 3/7/16 11:48 AM:
----------------------------------------------------------------------

Point taken, though I'd rank it higher than minor since it severely effect non-batch applications
(Which in client application terms would be considered a bug; not a Spark bug). In any case,
do you think this would be better placed on the mailing list?


was (Author: jid1):
Point taken, though I'd rank it higher than minor since it severely effect non-batch applications
(Which is application terms would be considered a bug). In any case, do you think this would
be better placed on the mailing list?

> Scheduler "creating" straggler node 
> ------------------------------------
>
>                 Key: SPARK-13718
>                 URL: https://issues.apache.org/jira/browse/SPARK-13718
>             Project: Spark
>          Issue Type: Improvement
>          Components: Scheduler, Spark Core
>    Affects Versions: 1.3.1
>         Environment: Spark 1.3.1
> MapR-FS
> Single Rack
> Standalone mode scheduling
> 8 node cluster
> 48 cores & 512 RAM / node
> Data Replication factor of 3
> Each Node has 4 Spark executors configured with 12 cores each and 22GB of RAM.
>            Reporter: Ioannis Deligiannis
>            Priority: Minor
>
> *Data:*
> * Assume an even distribution of data across the cluster with a replication factor of
3.
> * In-memory data are partitioned in 128 chunks (384 cores in total, i.e. 3 requests can
be executed concurrently(-ish) )
> *Action:*
> * Action is a simple sequence of map/filter/reduce. 
> * The action operates upon and returns a small subset of data (following the full map
over the data).
> * Data are 1 x cached serialized in memory (Kryo), so calling the action  should not
hit the disk under normal conditions.
> * Action network usage is low as it returns a small number of aggregated results and
does not require excessive shuffling
> * Under low or moderate load, each action is expected to complete in less than 2 seconds
> *H/W Outlook*
> When the action is called in high numbers, initially the cluster CPU gets close to 100%
(which is expected & intended). 
> After a while, the cluster utilization reduces significantly with only one (struggler)
node having 100% CPU and fully utilized network.
> *Diagnosis:*
> 1. Attached a profiler to the driver and executors to monitor GC or I/O issues and everything
is normal under low or heavy usage. 
> 2. Cluster network usage is very low
> 3. No issues on Spark UI except that tasks begin to  move from LOCAL to ANY
> *Cause :*
> 1. Node 'H' is doing marginally more work than the rest (being a little slower and at
almost 100% CPU)
> 2. Scheduler hits the default 3000 millis spark.locality.wait and assigns the task to
other nodes (In some cases it will assign to NODE which means load from HDD and then follow
the sequence and fallback to ANY)
> 3. One of the nodes 'X' that accepted the task will eventually try to access the data
from node 'H' HDD. This adds HDD and Network I/O to node and also some extra CPU for I/O.
> 4. 'X' time to complete increases ~5x as it involves HDD/Network 
> 5. Eventually, every node has a task that is waiting to fetch that specific partition
from node 'H' so cluster is basically blocked on a single node
> * Proposed Fix *
> I have not worked with Scala enough to propose a code fix, but on a high level, when
a task hits the 'spark.locality.wait' timeout, it should provide a 'hint' to the node accepting
the task to use as a data source 'replica' that is not on the node that failed to accept the
task in the first place.
> *Workaround*
> Playing with 'spark.locality.wait' values, there is a deterministic value depending on
partitions and config where the problem ceases to exist.
> *PS1* : Don't have enough Scala skils to follow the issue or propose a fix, but I hope
that this has enough information to make sense.
> *PS2* : Debugging this issue made me realize that there can be a lot of use-cases that
trigger this behaviour



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message