flink-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "buptljy (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (FLINK-10168) support filtering files by modified/created time in StreamExecutionEnvironment.readFile()
Date Sun, 19 Aug 2018 11:23:00 GMT

    [ https://issues.apache.org/jira/browse/FLINK-10168?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16585121#comment-16585121

buptljy commented on FLINK-10168:

[~phoenixjiangnan] You're right. The second solution makes more sense to me.

I think we can provide a new FileFilter, which allow developers to define this FileFilter
based on: 
 # File name.
 # Modified time.
 # Create time.

> support filtering files by modified/created time in StreamExecutionEnvironment.readFile()
> -----------------------------------------------------------------------------------------
>                 Key: FLINK-10168
>                 URL: https://issues.apache.org/jira/browse/FLINK-10168
>             Project: Flink
>          Issue Type: Improvement
>          Components: DataStream API
>    Affects Versions: 1.6.0
>            Reporter: Bowen Li
>            Assignee: buptljy
>            Priority: Major
>             Fix For: 1.7.0
> support filtering files by modified/created time in {{StreamExecutionEnvironment.readFile()}}
> for example, in a source dir with lots of file, we only want to read files that is created
or modified after a specific time.
> This API can expose a generic filter function of files, and let users define filtering
rules. Currently Flink only supports filtering files by path. What this means is that, currently the
API is {{FileInputFormat.setFilesFilters(PathFiter)}} that takes only one file path filter. A
more generic API that can take more filters can look like this 1) {{FileInputFormat.setFilesFilters(List
(PathFiter, ModifiedTileFilter, ... ))}}
> 2) or {{FileInputFormat.setFilesFilters(FileFiter),}} and {{FileFilter}} exposes all
file attributes that Flink's file system can provide, like path and modified time
> I lean towards the 2nd option, because it gives users more flexibility to define complex
filtering rules based on combinations of file attributes.

This message was sent by Atlassian JIRA

View raw message