spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mark Hamstra <m...@clearstorydata.com>
Subject Re: simultaneous actions
Date Sun, 17 Jan 2016 20:23:45 GMT
It can be far more than that (e.g.
https://issues.apache.org/jira/browse/SPARK-11838), and is generally either
unrecognized or a greatly under-appreciated and underused feature of Spark.

On Sun, Jan 17, 2016 at 12:20 PM, Koert Kuipers <koert@tresata.com> wrote:

> the re-use of shuffle files is always a nice surprise to me
>
> On Sun, Jan 17, 2016 at 3:17 PM, Mark Hamstra <mark@clearstorydata.com>
> wrote:
>
>> Same SparkContext means same pool of Workers.  It's up to the Scheduler,
>> not the SparkContext, whether the exact same Workers or Executors will be
>> used to calculate simultaneous actions against the same RDD.  It is likely
>> that many of the same Workers and Executors will be used as the Scheduler
>> tries to preserve data locality, but that is not guaranteed.  In fact, what
>> is most likely to happen is that the shared Stages and Tasks being
>> calculated for the simultaneous actions will not actually be run at exactly
>> the same time, which means that shuffle files produced for one action will
>> be reused by the other(s), and repeated calculations will be avoided even
>> without explicitly caching/persisting the RDD.
>>
>> On Sun, Jan 17, 2016 at 8:06 AM, Koert Kuipers <koert@tresata.com> wrote:
>>
>>> Same rdd means same sparkcontext means same workers
>>>
>>> Cache/persist the rdd to avoid repeated jobs
>>> On Jan 17, 2016 5:21 AM, "Mennour Rostom" <mennour.r@gmail.com> wrote:
>>>
>>>> Hi,
>>>>
>>>> Thank you all for your answers,
>>>>
>>>> If I correctly understand, actions (in my case foreach) can be run
>>>> concurrently and simultaneously on the SAME rdd, (which is logical because
>>>> they are read only object). however, I want to know if the same workers are
>>>> used for the concurrent analysis ?
>>>>
>>>> Thank you
>>>>
>>>> 2016-01-15 21:11 GMT+01:00 Jakob Odersky <jodersky@gmail.com>:
>>>>
>>>>> I stand corrected. How considerable are the benefits though? Will the
>>>>> scheduler be able to dispatch jobs from both actions simultaneously (or
on
>>>>> a when-workers-become-available basis)?
>>>>>
>>>>> On 15 January 2016 at 11:44, Koert Kuipers <koert@tresata.com>
wrote:
>>>>>
>>>>>> we run multiple actions on the same (cached) rdd all the time, i
>>>>>> guess in different threads indeed (its in akka)
>>>>>>
>>>>>> On Fri, Jan 15, 2016 at 2:40 PM, Matei Zaharia <
>>>>>> matei.zaharia@gmail.com> wrote:
>>>>>>
>>>>>>> RDDs actually are thread-safe, and quite a few applications use
them
>>>>>>> this way, e.g. the JDBC server.
>>>>>>>
>>>>>>> Matei
>>>>>>>
>>>>>>> On Jan 15, 2016, at 2:10 PM, Jakob Odersky <jodersky@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>> I don't think RDDs are threadsafe.
>>>>>>> More fundamentally however, why would you want to run RDD actions
in
>>>>>>> parallel? The idea behind RDDs is to provide you with an abstraction
for
>>>>>>> computing parallel operations on distributed data. Even if you
were to call
>>>>>>> actions from several threads at once, the individual executors
of your
>>>>>>> spark environment would still have to perform operations sequentially.
>>>>>>>
>>>>>>> As an alternative, I would suggest to restructure your RDD
>>>>>>> transformations to compute the required results in one single
operation.
>>>>>>>
>>>>>>> On 15 January 2016 at 06:18, Jonathan Coveney <jcoveney@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Threads
>>>>>>>>
>>>>>>>>
>>>>>>>> El viernes, 15 de enero de 2016, Kira <mennour.r@gmail.com>
>>>>>>>> escribió:
>>>>>>>>
>>>>>>>>> Hi,
>>>>>>>>>
>>>>>>>>> Can we run *simultaneous* actions on the *same RDD* ?;
if yes how
>>>>>>>>> can this
>>>>>>>>> be done ?
>>>>>>>>>
>>>>>>>>> Thank you,
>>>>>>>>> Regards
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> View this message in context:
>>>>>>>>> http://apache-spark-user-list.1001560.n3.nabble.com/simultaneous-actions-tp25977.html
>>>>>>>>> Sent from the Apache Spark User List mailing list archive
at
>>>>>>>>> Nabble.com <http://nabble.com>.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> ---------------------------------------------------------------------
>>>>>>>>> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
>>>>>>>>> For additional commands, e-mail: user-help@spark.apache.org
>>>>>>>>>
>>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>
>

Mime
View raw message