spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Praveen R <prav...@sigmoidanalytics.com>
Subject Re: Lost an executor error - Jobs fail
Date Tue, 15 Apr 2014 05:03:38 GMT
Unfortunately queries kept failing with SparkTask101 errors and had them
working after removing the troublesome node.

FAILED: Execution Error, return code -101 from shark.execution.SparkTask

I wish it would have been easy to re-produce it. I shall give a try to hard
remove write permissions on one node to see if the same error happens.



On Tue, Apr 15, 2014 at 9:17 AM, Aaron Davidson <ilikerps@gmail.com> wrote:

> Cool! It's pretty rare to actually get logs from a wild hardware failure.
> The problem is as you said, that the executor keeps failing, but the worker
> doesn't get the hint, so it keeps creating new, bad executors.
>
> However, this issue should not have caused your cluster to fail to start
> up. In the linked logs, for instance, the shark shell started up just fine
> (though the "shark>" was lost in some of the log messages). Queries should
> have been able to execute just fine. Was this not the case?
>
>
> On Mon, Apr 14, 2014 at 7:38 AM, Praveen R <praveen@sigmoidanalytics.com>wrote:
>
>> Configuration comes from spark-ec2 setup script, sets spark.local.dir to
>> use /mnt/spark, /mnt2/spark.
>>  Setup actually worked for quite sometime and then on one of the node
>> there were some disk errors as
>>
>> mv: cannot remove
>> `/mnt2/spark/spark-local-20140409182103-c775/09/shuffle_1_248_0': Read-only
>> file system
>> mv: cannot remove
>> `/mnt2/spark/spark-local-20140409182103-c775/24/shuffle_1_260_0': Read-only
>> file system
>> mv: cannot remove
>> `/mnt2/spark/spark-local-20140409182103-c775/24/shuffle_2_658_0': Read-only
>> file system
>>
>> I understand the issue is hardware level but thought it would be great if
>> spark could handle it and avoid cluster going down.
>>
>>
>> On Mon, Apr 14, 2014 at 7:58 PM, giive chen <thegiive@gmail.com> wrote:
>>
>>> Hi Praveen
>>>
>>> What is your config about "* spark.local.dir" ? *
>>> Is all your worker has this dir and all worker has right permission on
>>> this dir?
>>>
>>> I think this is the reason of your error
>>>
>>> Wisely Chen
>>>
>>>
>>> On Mon, Apr 14, 2014 at 9:29 PM, Praveen R <praveen@sigmoidanalytics.com
>>> > wrote:
>>>
>>>> Had below error while running shark queries on 30 node cluster and was
>>>> not able to start shark server or run any jobs.
>>>>
>>>> *14/04/11 19:06:52 ERROR scheduler.TaskSchedulerImpl: Lost an executor
>>>> 4 (already removed): Failed to create local directory (bad
>>>> spark.local.dir?)*
>>>> *Full log: *https://gist.github.com/praveenr019/10647049
>>>>
>>>> After spending quite some time, found it was due to disk read errors on
>>>> one node and had the cluster working after removing the node.
>>>>
>>>> Wanted to know if there is any configuration (like akkaTimeout) which
>>>> can handle this or does mesos help ?
>>>>
>>>> Shouldn't the worker be marked dead in such scenario, instead of making
>>>> the cluster non-usable so the debugging can be done at leisure.
>>>>
>>>> Thanks,
>>>> Praveen R
>>>>
>>>>
>>>>
>>>
>>
>

Mime
View raw message