flink-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Huyen Levan (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (FLINK-9940) File source continuous monitoring mode: S3 files sometimes missed
Date Fri, 27 Jul 2018 11:59:00 GMT

    [ https://issues.apache.org/jira/browse/FLINK-9940?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16559639#comment-16559639
] 

Huyen Levan commented on FLINK-9940:
------------------------------------

I'm currently working on a fix for this bug. However, I could not assign it to myself.

> File source continuous monitoring mode: S3 files sometimes missed
> -----------------------------------------------------------------
>
>                 Key: FLINK-9940
>                 URL: https://issues.apache.org/jira/browse/FLINK-9940
>             Project: Flink
>          Issue Type: Bug
>          Components: Streaming
>    Affects Versions: 1.5.1
>         Environment: Flink 1.5, EMRFS
>            Reporter: Huyen Levan
>            Priority: Major
>              Labels: EMRFS, Flink, S3
>
> When using StreamExecutionEnvironment.readFile() with FileProcessingMode.PROCESS_CONTINUOUSLY
mode to monitor an S3 prefix, if there is a high amount of new/modified files at the same
time, the directory monitoring process might miss some files. The number of missing files
depends on the monitoring interval.
> Cause: Flink tracks which files it has read by remembering the modification time of
the file that was added (or modified) last. So when there are multiple files having a same
last-modified timestamp.
> Suggested solution (thanks to [[Fabian Hueske|http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/template/NamlServlet.jtp?macro=user_nodes&user=25]|http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/template/NamlServlet.jtp?macro=user_nodes&user=25]): a
hybrid approach that keeps the names of all files that have a mod timestamp that is larger
than the max mod time minus an offset. _org.apache.flink.streaming.api.functions.source.ContinuousFileMonitoringFunction_



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Mime
View raw message