flink-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Fabian Hueske (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (FLINK-8599) Improve the failure behavior of the ContinuousFileReaderOperator for bad files
Date Thu, 08 Feb 2018 08:29:00 GMT

    [ https://issues.apache.org/jira/browse/FLINK-8599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16356638#comment-16356638

Fabian Hueske commented on FLINK-8599:

[~kkl0u]: IIRC, you've been involved in the design of the monitoring file source. What do
you think?

> Improve the failure behavior of the ContinuousFileReaderOperator for bad files
> ------------------------------------------------------------------------------
>                 Key: FLINK-8599
>                 URL: https://issues.apache.org/jira/browse/FLINK-8599
>             Project: Flink
>          Issue Type: New Feature
>          Components: DataStream API
>    Affects Versions: 1.4.0, 1.3.2
>            Reporter: Chengzhi Zhao
>            Priority: Major
> So we have a s3 path that flink is monitoring that path to see new files available.
> {code:java}
> val avroInputStream_activity = env.readFile(format, path, FileProcessingMode.PROCESS_CONTINUOUSLY,
10000)  {code}
> I am doing both internal and external check pointing and let's say there is a bad file
(for example, a different schema been dropped in this folder) came to the path and flink will
do several retries. I want to take those bad files and let the process continue. However,
since the file path persist in the checkpoint, when I try to resume from external checkpoint,
it threw the following error on no file been found.
> {code:java}
> java.io.IOException: Error opening the Input Split s3a://myfile [0,904]: No such file
or directory: s3a://myfile{code}
> As [~fhueske@gmail.com] suggested, we could check if a path exists and before trying
to read a file and ignore the input split instead of throwing an exception and causing a failure.
> Also, I am thinking about add an error output for bad files as an option to users. So
if there is any bad files exist we could move them in a separated path and do further analysis. 
> Not sure how people feel about it, but I'd like to contribute on it if people think this
can be an improvement. 

This message was sent by Atlassian JIRA

View raw message