flink-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Stephan Ewen <se...@apache.org>
Subject Re: Job fails with FileNotFoundException from blobStore
Date Thu, 05 Feb 2015 15:57:31 GMT
Sounds good. In the course of this, we should probably extend the IOManager
that it keeps track of temp files and deletes them when a task is done.

On Thu, Feb 5, 2015 at 4:40 PM, Ufuk Celebi <uce@apache.org> wrote:

> After talking to Robert and Till offline, what about the following:
>
> - We add a shutdown hook to the blob library cache manager to shutdown the
> blob service (just a delete call)
> - As Robert pointed out, we cannot do this with the IOManager paths right
> now, because they are essentially shared among multiple Flink instances.
> Therefore we add an IOManager directory per Flink instance as well, which
> we can simply delete on shutdown.
>
> Is that OK?
>
> On Thu, Feb 5, 2015 at 4:23 PM, Stephan Ewen <sewen@apache.org> wrote:
>
>> I think that process killing (HALT signal) is a very typical way in Linux
>> to shut down processes. It is the most robust way, since it does not
>> require to send any custom messages to the process.
>>
>> This is sort of graceful, as the JVM gets the signal and may do a lot of
>> things before shutting down, such as running shutdown hooks. The ungraceful
>> variant is the KILL signal, which just removes the process.
>>
>>
>>
>> On Thu, Feb 5, 2015 at 4:16 PM, Till Rohrmann <trohrmann@apache.org>
>> wrote:
>>
>>> Hmm this is not very gentleman-like to terminate the Job/TaskManagers.
>>> I'll check how the ActorSystem behaves in case of killing the process.
>>>
>>> Why can't we implement a more graceful termination mechanism? For
>>> example, we could send a termination message to the JobManager and
>>> TaskManagers.
>>>
>>> On Thu, Feb 5, 2015 at 4:10 PM, Ufuk Celebi <uce@apache.org> wrote:
>>>
>>>> Thank you very much, Robert!
>>>>
>>>> The problem is that the job/task manager shutdown methods are never
>>>> called. When using the scripts, the task/job manager processes get killed
>>>> and therefore shutdown methods are never called.
>>>>
>>>> @Till: Do you know whether there is a mechanism in Akka to register the
>>>> actors for JVM shutdown hooks? I tried to register a shutdown hook via
>>>> Runtime.getRuntime().addShutdownHook(), but I didn't manage to get a
>>>> reference to the task manager.
>>>>
>>>>
>>>> On Thu, Feb 5, 2015 at 3:29 PM, Till Rohrmann <trohrmann@apache.org>
>>>> wrote:
>>>>
>>>>> Hi Robert,
>>>>>
>>>>> thanks for the info. If the TaskManager/JobManager does not shutdown
>>>>> properly, i.e. killing of the process, then it is indeed the case that
the
>>>>> BlobManager cannot properly remove all stored files. I don't know if
this
>>>>> was lately the case for you. Furthermore, the files are not directly
>>>>> deleted after the job has finished. Internally there is a cleanup task
>>>>> which is triggered every our and deletes all blobs which are no longer
>>>>> referenced.
>>>>>
>>>>> But we definitely have to look into it to see how we could improve
>>>>> this behaviour.
>>>>>
>>>>> Greets,
>>>>>
>>>>> Till
>>>>>
>>>>> On Thu, Feb 5, 2015 at 3:21 PM, Robert Waury <
>>>>> robert.waury@googlemail.com> wrote:
>>>>>
>>>>>> I talked with the admins. The problem seemed to have been that the
>>>>>> disk was full and Flink couldn't create the directory.
>>>>>>
>>>>>> Maybe the the error message should reflect if that is the cause.
>>>>>>
>>>>>> While cleaning up the disk we noticed that a lot of temporary
>>>>>> blobStore files were not deleted by Flink after the job finished.
This
>>>>>> seemed to have caused or at least worsened the problem.
>>>>>>
>>>>>> Cheers,
>>>>>> Robert
>>>>>>
>>>>>> On Thu, Feb 5, 2015 at 1:14 PM, Ufuk Celebi <uce@apache.org>
wrote:
>>>>>>
>>>>>>> On Thu, Feb 5, 2015 at 11:23 AM, Robert Waury <
>>>>>>> robert.waury@googlemail.com> wrote:
>>>>>>>
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> I can reproduce the error on my cluster.
>>>>>>>>
>>>>>>>> Unfortunately I can't check whether the parent directories
were
>>>>>>>> created on the different nodes since I have no way of accessing
them. I
>>>>>>>> start all the jobs from a gateway.
>>>>>>>>
>>>>>>>
>>>>>>> I've added a check to the directory creation (in branches
>>>>>>> release-0.8 and master), which should fail with a proper error
message if
>>>>>>> that is the problem. If you have time to (re)deploy Flink, it
would be
>>>>>>> great to know if that indeed is the issue. Otherwise, we need
to further
>>>>>>> investigate this.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Mime
View raw message