flink-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Martin Neumann <mneum...@sics.se>
Subject Re: streaming hdfs sub folders
Date Wed, 17 Feb 2016 18:33:14 GMT
The program is a DataStream program, it usually it gets the data from
kafka. It's an anomaly detection program that learns from the stream
itself. The reason I want to read from files is to test different settings
of the algorithm and compare them.

I think I don't need to reply things in the exact order (wich is not
possible with parallel reads anyway) and I have written the program so it
can deal with out of order events.
I only need the subfolders to be processed roughly in order. Its fine to
process some stuff from 01 before everything from 00 is finished, if I get
records from all 24 subfolders at the same time things will break though.
If I set the flag will it try to get data from all sub dir's in parallel or
will it go sub dir by sub dir?

Also can you point me to some documentation or something where I can see
how to set the Flag?

cheers Martin




On Wed, Feb 17, 2016 at 11:49 AM, Stephan Ewen <sewen@apache.org> wrote:

> Hi!
>
> Going through nested folders is pretty simple, there is a flag on the
> FileInputFormat that makes sure those are read.
>
> Tricky is the part that all "00" files should be read before the "01"
> files. If you still want parallel reads, that means you need to sync at
> some point, wait for all parallel parts to finish with the "00" work before
> anyone may start with the "01" work.
>
> Is your training program a DataStream or a DataSet program?`
>
> Stephan
>
> On Wed, Feb 17, 2016 at 1:16 AM, Martin Neumann <mneumann@sics.se> wrote:
>
>> Hi,
>>
>> I have a streaming machine learning job that usually runs with input from
>> kafka. To tweak the models I need to run on some old data from HDFS.
>>
>> Unfortunately the data on HDFS is spread out over several subfolders.
>> Basically I have a datum with one subfolder for each hour within those are
>> the actual input files I'm interested in.
>>
>> Basically what I need is a source that goes through the subfolder in
>> order and streams the files into the program. I'm using event timestamps so
>> all files in 00 need to be processed before 01.
>>
>> Has anyone an idea on how to do this?
>>
>> cheers Martin
>>
>>
>

Mime
View raw message