flink-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Till Rohrmann <trohrm...@apache.org>
Subject Re: Job fails with FileNotFoundException from blobStore
Date Thu, 05 Feb 2015 15:16:20 GMT
Hmm this is not very gentleman-like to terminate the Job/TaskManagers. I'll
check how the ActorSystem behaves in case of killing the process.

Why can't we implement a more graceful termination mechanism? For example,
we could send a termination message to the JobManager and TaskManagers.

On Thu, Feb 5, 2015 at 4:10 PM, Ufuk Celebi <uce@apache.org> wrote:

> Thank you very much, Robert!
>
> The problem is that the job/task manager shutdown methods are never
> called. When using the scripts, the task/job manager processes get killed
> and therefore shutdown methods are never called.
>
> @Till: Do you know whether there is a mechanism in Akka to register the
> actors for JVM shutdown hooks? I tried to register a shutdown hook via
> Runtime.getRuntime().addShutdownHook(), but I didn't manage to get a
> reference to the task manager.
>
>
> On Thu, Feb 5, 2015 at 3:29 PM, Till Rohrmann <trohrmann@apache.org>
> wrote:
>
>> Hi Robert,
>>
>> thanks for the info. If the TaskManager/JobManager does not shutdown
>> properly, i.e. killing of the process, then it is indeed the case that the
>> BlobManager cannot properly remove all stored files. I don't know if this
>> was lately the case for you. Furthermore, the files are not directly
>> deleted after the job has finished. Internally there is a cleanup task
>> which is triggered every our and deletes all blobs which are no longer
>> referenced.
>>
>> But we definitely have to look into it to see how we could improve this
>> behaviour.
>>
>> Greets,
>>
>> Till
>>
>> On Thu, Feb 5, 2015 at 3:21 PM, Robert Waury <robert.waury@googlemail.com
>> > wrote:
>>
>>> I talked with the admins. The problem seemed to have been that the disk
>>> was full and Flink couldn't create the directory.
>>>
>>> Maybe the the error message should reflect if that is the cause.
>>>
>>> While cleaning up the disk we noticed that a lot of temporary blobStore
>>> files were not deleted by Flink after the job finished. This seemed to have
>>> caused or at least worsened the problem.
>>>
>>> Cheers,
>>> Robert
>>>
>>> On Thu, Feb 5, 2015 at 1:14 PM, Ufuk Celebi <uce@apache.org> wrote:
>>>
>>>> On Thu, Feb 5, 2015 at 11:23 AM, Robert Waury <
>>>> robert.waury@googlemail.com> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> I can reproduce the error on my cluster.
>>>>>
>>>>> Unfortunately I can't check whether the parent directories were
>>>>> created on the different nodes since I have no way of accessing them.
I
>>>>> start all the jobs from a gateway.
>>>>>
>>>>
>>>> I've added a check to the directory creation (in branches release-0.8
>>>> and master), which should fail with a proper error message if that is the
>>>> problem. If you have time to (re)deploy Flink, it would be great to know
if
>>>> that indeed is the issue. Otherwise, we need to further investigate this.
>>>>
>>>>
>>>>
>>>
>>
>

Mime
View raw message