flink-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Martin Neumann <mneum...@sics.se>
Subject streaming hdfs sub folders
Date Wed, 17 Feb 2016 00:16:46 GMT

I have a streaming machine learning job that usually runs with input from
kafka. To tweak the models I need to run on some old data from HDFS.

Unfortunately the data on HDFS is spread out over several subfolders.
Basically I have a datum with one subfolder for each hour within those are
the actual input files I'm interested in.

Basically what I need is a source that goes through the subfolder in order
and streams the files into the program. I'm using event timestamps so all
files in 00 need to be processed before 01.

Has anyone an idea on how to do this?

cheers Martin

View raw message