flink-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Stephan Ewen (JIRA)" <j...@apache.org>
Subject [jira] [Created] (FLINK-5332) Non-thread safe FileSystem::initOutPathLocalFS() can cause lost files/directories in local execution
Date Tue, 13 Dec 2016 19:48:58 GMT
Stephan Ewen created FLINK-5332:
-----------------------------------

             Summary: Non-thread safe FileSystem::initOutPathLocalFS() can cause lost files/directories
in local execution
                 Key: FLINK-5332
                 URL: https://issues.apache.org/jira/browse/FLINK-5332
             Project: Flink
          Issue Type: Bug
          Components: Core
    Affects Versions: 1.2.0
            Reporter: Stephan Ewen
            Assignee: Stephan Ewen
            Priority: Critical
             Fix For: 1.2.0


This is mainly relevant to tests and Local Mini Cluster executions.

The {{FileOutputFormat}} and its subclasses rely on {{FileSystem::initOutPathLocalFS()}} to
prepare the output directory. When multiple parallel output writers call that method, there
is a slim chance that one parallel threads deletes the others directory. The checks that the
method has are not bullet proof.

I believe that this is the cause for many Travis test instabilities that we observed over
time.

Simply synchronizing that method per process should do the trick. Since it is a rare initialization
method, and only relevant in tests & local mini cluster executions, it should be a price
that is okay to pay. I see no other way, as we do not have simple access to an atomic "check
and delete and recreate" file operation.

The synchronization also makes many "re-try" code paths obsolete (there should be no re-tries
needed on proper file systems).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message