spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Imran Rashid (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (SPARK-20589) Allow limiting task concurrency per stage
Date Wed, 16 Aug 2017 20:53:01 GMT

    [ https://issues.apache.org/jira/browse/SPARK-20589?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16129380#comment-16129380
] 

Imran Rashid commented on SPARK-20589:
--------------------------------------

Its pretty much the same thing whether you're trying to limit in the beginning the pipeline,
or at the end (or anywhere in between), that was just an example.  My suggested workaround
is *not* to change the number of partitions -- I know the spark is very sensitive to the number
of partitions for all sorts of reasons.  I'm suggesting you have multiple applications, each
with a different number of *executors*.  So you can still have a large number of tasks, but
with a small number of executors you'll constrain concurrency.

also to be clear, the current proposed fix requires *exactly* the thing you are saying you
don't want to do: "breaking the pipeline into different stages and running each with different
configs".  You need to take something like

{code}
bigRDD.map(...).filter(...).reduceByKey(...).flatMap(...).join(...).map(...).saveToSomeRateLimitedDestination()
{code}

into

{code}
sc.setJobGroup(...)
val dataReadyToSaveExternally = bigRDD.map(...).filter(...).reduceByKey(...).flatMap(...).join(...).map(...)
dataReadyToSaveExternally.persist(DISK)
dataReadyToSave.count()

sc.setJobGroup(...)
dataReadyToSave.saveToSomeRateLimitedDestination()
sc.setJobGroup(...)
{code}

You still need to break up the operations on your RDD, and persist the intermediate data somewhere.

In any case, I do understand that is simpler that having two entirely independent spark applications.
 But I want to make sure this would actually help as much as you are expecting.

> Allow limiting task concurrency per stage
> -----------------------------------------
>
>                 Key: SPARK-20589
>                 URL: https://issues.apache.org/jira/browse/SPARK-20589
>             Project: Spark
>          Issue Type: Improvement
>          Components: Scheduler
>    Affects Versions: 2.1.0
>            Reporter: Thomas Graves
>
> It would be nice to have the ability to limit the number of concurrent tasks per stage.
 This is useful when your spark job might be accessing another service and you don't want
to DOS that service.  For instance Spark writing to hbase or Spark doing http puts on a service.
 Many times you want to do this without limiting the number of partitions. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message