flink-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Fabian Hueske <fhue...@gmail.com>
Subject Re: FileInputFormat that processes files in chronological order
Date Mon, 29 Apr 2019 09:26:11 GMT
Hi Sergei,

It depends whether you want to process the file with the DataSet (batch) or
DataStream (stream) API.
Averell's answer was addressing the DataStream API part.

The DataSet API does not have any built-in support to distinguish files (or
file splits) by folders and process them in order.
For the DataSet API, you would need to implement a custom InputFormat
(based on FileInputFormat) with a custom InputSplitAssigner implementations.
The InputSplitAssigner would need to assign splits to hosts based on their
path and in the correct order.

Best,
Fabian

Am So., 28. Apr. 2019 um 08:48 Uhr schrieb Averell <lvhuyen@gmail.com>:

> Hi,
>
> Regarding splitting by shards, I believe that you can simply create two
> sources, one for each shard. After that, union them together.
>
> Regarding processing files in chronological order, Flink currently reads
> files using the files' last-modified-time order (i.e. oldest files will be
> processed first). So if your file1.json is older than file2, file2 is older
> than file3, then you don't need to do anything.
> If your file-times are not in that order, then I think its more complex.
> But
> I am curious about why there are such requirements first. Is this a
> streaming problem?
>
> I don't think FileInputFormat has anything to do here. Use that when your
> files are in a format not currently supported by Flink.
>
> Regards,
> Averell
>
>
>
> --
> Sent from:
> http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/
>

Mime
View raw message