spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Victor Tso-Guillen <v...@paxata.com>
Subject Re: Scheduler hang?
Date Fri, 27 Feb 2015 04:32:47 GMT
Of course, breakpointing on every status update and revive offers
invocation kept the problem from happening. Where could the race be?

On Thu, Feb 26, 2015 at 7:55 PM, Victor Tso-Guillen <vtso@paxata.com> wrote:

> Love to hear some input on this. I did get a standalone cluster up on my
> local machine and the problem didn't present itself. I'm pretty confident
> that means the problem is in the LocalBackend or something near it.
>
> On Thu, Feb 26, 2015 at 1:37 PM, Victor Tso-Guillen <vtso@paxata.com>
> wrote:
>
>> Okay I confirmed my suspicions of a hang. I made a request that stopped
>> progressing, though the already-scheduled tasks had finished. I made a
>> separate request that was small enough not to hang, and it kicked the hung
>> job enough to finish. I think what's happening is that the scheduler or the
>> local backend is not kicking the revive offers messaging at the right time,
>> but I have to dig into the code some more to nail the culprit. Anyone on
>> these list have experience in those code areas that could help?
>>
>> On Thu, Feb 26, 2015 at 2:27 AM, Victor Tso-Guillen <vtso@paxata.com>
>> wrote:
>>
>>> Thanks for the link. Unfortunately, I turned on rdd compression and
>>> nothing changed. I tried moving netty -> nio and no change :(
>>>
>>> On Thu, Feb 26, 2015 at 2:01 AM, Akhil Das <akhil@sigmoidanalytics.com>
>>> wrote:
>>>
>>>> Not many that i know of, but i bumped into this one
>>>> https://issues.apache.org/jira/browse/SPARK-4516
>>>>
>>>> Thanks
>>>> Best Regards
>>>>
>>>> On Thu, Feb 26, 2015 at 3:26 PM, Victor Tso-Guillen <vtso@paxata.com>
>>>> wrote:
>>>>
>>>>> Is there any potential problem from 1.1.1 to 1.2.1 with shuffle
>>>>> dependencies that produce no data?
>>>>>
>>>>> On Thu, Feb 26, 2015 at 1:56 AM, Victor Tso-Guillen <vtso@paxata.com>
>>>>> wrote:
>>>>>
>>>>>> The data is small. The job is composed of many small stages.
>>>>>>
>>>>>> * I found that with fewer than 222 the problem exhibits. What will
be
>>>>>> gained by going higher?
>>>>>> * Pushing up the parallelism only pushes up the boundary at which
the
>>>>>> system appears to hang. I'm worried about some sort of message loss
or
>>>>>> inconsistency.
>>>>>> * Yes, we are using Kryo.
>>>>>> * I'll try that, but I'm again a little confused why you're
>>>>>> recommending this. I'm stumped so might as well?
>>>>>>
>>>>>> On Wed, Feb 25, 2015 at 11:13 PM, Akhil Das <
>>>>>> akhil@sigmoidanalytics.com> wrote:
>>>>>>
>>>>>>> What operation are you trying to do and how big is the data that
you
>>>>>>> are operating on?
>>>>>>>
>>>>>>> Here's a few things which you can try:
>>>>>>>
>>>>>>> - Repartition the RDD to a higher number than 222
>>>>>>> - Specify the master as local[*] or local[10]
>>>>>>> - Use Kryo Serializer (.set("spark.serializer",
>>>>>>> "org.apache.spark.serializer.KryoSerializer"))
>>>>>>> - Enable RDD Compression (.set("spark.rdd.compress","true") )
>>>>>>>
>>>>>>>
>>>>>>> Thanks
>>>>>>> Best Regards
>>>>>>>
>>>>>>> On Thu, Feb 26, 2015 at 10:15 AM, Victor Tso-Guillen <
>>>>>>> vtso@paxata.com> wrote:
>>>>>>>
>>>>>>>> I'm getting this really reliably on Spark 1.2.1. Basically
I'm in
>>>>>>>> local mode with parallelism at 8. I have 222 tasks and I
never seem to get
>>>>>>>> far past 40. Usually in the 20s to 30s it will just hang.
The last logging
>>>>>>>> is below, and a screenshot of the UI.
>>>>>>>>
>>>>>>>> 2015-02-25 20:39:55.779 GMT-0800 INFO  [task-result-getter-3]
>>>>>>>> TaskSetManager - Finished task 3.0 in stage 16.0 (TID 22)
in 612 ms on
>>>>>>>> localhost (1/5)
>>>>>>>> 2015-02-25 20:39:55.825 GMT-0800 INFO  [Executor task launch
>>>>>>>> worker-10] Executor - Finished task 1.0 in stage 16.0 (TID
20). 2492 bytes
>>>>>>>> result sent to driver
>>>>>>>> 2015-02-25 20:39:55.825 GMT-0800 INFO  [Executor task launch
>>>>>>>> worker-8] Executor - Finished task 2.0 in stage 16.0 (TID
21). 2492 bytes
>>>>>>>> result sent to driver
>>>>>>>> 2015-02-25 20:39:55.831 GMT-0800 INFO  [task-result-getter-0]
>>>>>>>> TaskSetManager - Finished task 1.0 in stage 16.0 (TID 20)
in 670 ms on
>>>>>>>> localhost (2/5)
>>>>>>>> 2015-02-25 20:39:55.836 GMT-0800 INFO  [task-result-getter-1]
>>>>>>>> TaskSetManager - Finished task 2.0 in stage 16.0 (TID 21)
in 674 ms on
>>>>>>>> localhost (3/5)
>>>>>>>> 2015-02-25 20:39:55.891 GMT-0800 INFO  [Executor task launch
>>>>>>>> worker-9] Executor - Finished task 0.0 in stage 16.0 (TID
19). 2492 bytes
>>>>>>>> result sent to driver
>>>>>>>> 2015-02-25 20:39:55.896 GMT-0800 INFO  [task-result-getter-2]
>>>>>>>> TaskSetManager - Finished task 0.0 in stage 16.0 (TID 19)
in 740 ms on
>>>>>>>> localhost (4/5)
>>>>>>>>
>>>>>>>> [image: Inline image 1]
>>>>>>>> What should I make of this? Where do I start?
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Victor
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Mime
  • Unnamed multipart/related (inline, None, 0 bytes)
View raw message