spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Thunder Stumpges (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (SPARK-19371) Cannot spread cached partitions evenly across executors
Date Tue, 14 Nov 2017 19:02:01 GMT

     [ https://issues.apache.org/jira/browse/SPARK-19371?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Thunder Stumpges updated SPARK-19371:
-------------------------------------
    Attachment: Unbalanced RDD Blocks, and resulting task imbalance.png

> Cannot spread cached partitions evenly across executors
> -------------------------------------------------------
>
>                 Key: SPARK-19371
>                 URL: https://issues.apache.org/jira/browse/SPARK-19371
>             Project: Spark
>          Issue Type: Bug
>    Affects Versions: 1.6.1
>            Reporter: Thunder Stumpges
>         Attachments: Unbalanced RDD Blocks, and resulting task imbalance.png, Unbalanced
RDD Blocks, and resulting task imbalance.png
>
>
> Before running an intensive iterative job (in this case a distributed topic model training),
we need to load a dataset and persist it across executors. 
> After loading from HDFS and persisting, the partitions are spread unevenly across executors
(based on the initial scheduling of the reads which are not data locale sensitive). The partition
sizes are even, just not their distribution over executors. We currently have no way to force
the partitions to spread evenly, and as the iterative algorithm begins, tasks are distributed
to executors based on this initial load, forcing some very unbalanced work.
> This has been mentioned a [number|http://apache-spark-developers-list.1001551.n3.nabble.com/RDD-Partitions-not-distributed-evenly-to-executors-tt16988.html#a17059]
of [times|http://apache-spark-user-list.1001560.n3.nabble.com/Spark-work-distribution-among-execs-tt26502.html]
in [various|http://apache-spark-user-list.1001560.n3.nabble.com/Partitions-are-get-placed-on-the-single-node-tt26597.html]
user/dev group threads.
> None of the discussions I could find had solutions that worked for me. Here are examples
of things I have tried. All resulted in partitions in memory that were NOT evenly distributed
to executors, causing future tasks to be imbalanced across executors as well.
> *Reduce Locality*
> {code}spark.shuffle.reduceLocality.enabled=false/true{code}
> *"Legacy" memory mode*
> {code}spark.memory.useLegacyMode = true/false{code}
> *Basic load and repartition*
> {code}
> val numPartitions = 48*16
> val df = sqlContext.read.
>     parquet("/data/folder_to_load").
>     repartition(numPartitions).
>     persist
> df.count
> {code}
> *Load and repartition to 2x partitions, then shuffle repartition down to desired partitions*
> {code}
> val numPartitions = 48*16
> val df2 = sqlContext.read.
>     parquet("/data/folder_to_load").
>     repartition(numPartitions*2)
> val df = df2.repartition(numPartitions).
>     persist
> df.count
> {code}
> It would be great if when persisting an RDD/DataFrame, if we could request that those
partitions be stored evenly across executors in preparation for future tasks. 
> I'm not sure if this is a more general issue (I.E. not just involving persisting RDDs),
but for the persisted in-memory case, it can make a HUGE difference in the over-all running
time of the remaining work.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message