flink-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Stephan Ewen (JIRA)" <j...@apache.org>
Subject [jira] [Created] (FLINK-3515) Make the "file monitoring source" exactly-once
Date Thu, 25 Feb 2016 18:19:18 GMT
Stephan Ewen created FLINK-3515:

             Summary: Make the "file monitoring source" exactly-once
                 Key: FLINK-3515
                 URL: https://issues.apache.org/jira/browse/FLINK-3515
             Project: Flink
          Issue Type: Improvement
          Components: Streaming
    Affects Versions: 0.10.2
            Reporter: Stephan Ewen

The stream source that watches directories for changes is currently not "exactly-once".

To make it exactly once, the source (that generates files to be read) and the flatMap (that
reads the files) need to keep track of where they were at the point of a checkpoint.

Assuming that files do not change after creation (HDFS / S3 style), we can make this the following

  - The source can track the files it already emitted downstream via file creation/modification
timestamp, assuming that new files always get newer timestamps.

  - The flatMappers need to always store the path of their current file fragment, plus the
byte offset where they were within that file split.

This message was sent by Atlassian JIRA

View raw message