flink-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From rmetzger <...@git.apache.org>
Subject [GitHub] flink pull request #2198: [FLINK-4133] Reflect streaming file source changes...
Date Tue, 05 Jul 2016 13:13:52 GMT
Github user rmetzger commented on a diff in the pull request:

    --- Diff: docs/apis/streaming/index.md ---
    @@ -1310,21 +1310,28 @@ Data Sources
     <br />
    -Sources can by created by using `StreamExecutionEnvironment.addSource(sourceFunction)`.
    -You can either use one of the source functions that come with Flink or write a custom
    -by implementing the `SourceFunction` for non-parallel sources, or by implementing the
    -`ParallelSourceFunction` interface or extending `RichParallelSourceFunction` for parallel
    +Sources are where your program reads its input from. You can attach a source to your
program by 
    +using `StreamExecutionEnvironment.addSource(sourceFunction)`. Flink comes with a number
of pre-implemented 
    +source functions, but you can always write your own custom sources by implementing the
    +for non-parallel sources, or by implementing the `ParallelSourceFunction` interface or
extending the 
    +`RichParallelSourceFunction` for parallel sources.
     There are several predefined stream sources accessible from the `StreamExecutionEnvironment`:
    -- `readTextFile(path)` / `TextInputFormat` - Reads files line wise and returns them as
    +- `readTextFile(path)` - Reads text files, i.e. files that respect the `TextInputFormat`
specification, line-by-line and returns them as Strings.
    -- `readFile(path)` / Any input format - Reads files as dictated by the input format.
    -- `readFileStream` - create a stream by appending elements when there are changes to
a file
    +- `readFile(fileInputFormat, path)` - Reads (once) files as dictated by the specified
file input format.
    +- `readFile(fileInputFormat, path, watchType, interval, pathFilter, typeInfo)` -  This
is the method called internally by the two previous ones. It reads files in the `path` based
on the given `fileInputFormat`. Depending on the provided `watchType`, this source may periodically
monitor (every `interval` ms) the path for new data (`FileProcessingMode.PROCESS_CONTINUOUSLY`),
or process once the data currently in the path and exit (`FileProcessingMode.PROCESS_ONCE`).
Using the `pathFilter`, the user can further exclude files from being processed.
    +    *IMPORTANT NOTES:* 
    +    1. If the `watchType` is set to `FileProcessingMode.PROCESS_CONTINUOUSLY`, when a
file is modified, its contents are re-processed entirely. This can brake the "exactly-once"
semantics, as appending data at the end of a file will lead to **all** its contents being
    +    2. If the `watchType` is set to `FileProcessingMode#PROCESS_ONCE`, the source scans
the path **once** and exits, without waiting for the readers to finish reading. This leads
to no more checkpoints after that point, thus providing reduced fault-tolerance guarantees.
    --- End diff --
    Why are we not keeping the file-monitoring task alive until the job is cancelled? 

If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.

View raw message