flink-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Tim Conrad <con...@math.fu-berlin.de>
Subject Best way to process data in many files? (FLINK-BATCH)
Date Tue, 23 Feb 2016 12:49:11 GMT

Dear FLINK community.

I was wondering what would be the recommended (best?) way to achieve 
some kind of file conversion. That runs in parallel on all available 
Flink Nodes, since it it "embarrassingly parallel" (no dependency 
between files).

Say, I have a HDFS folder that contains multiple structured text-files 
containing (x,y) pairs (think of CVS).

For each of these files I want to do (individual for each file) the 

* Read file from HDFS
* Extract dataset(s) from file (e.g. list of (x,y) pairs)
* Apply some filter (e.g. smoothing)
* Do some pattern recognition on smoothed data
* Write results back to HDFS (different format)

Would the following be a good idea?

DataSource<String> fileList = ... // contains list of file names in HDFS

// For each "filename" in list do...
DataSet<FeatureList> featureList = fileList
                 .flatMap(new ReadDataSetFromFile()) // flatMap because 
there might multiple DataSets in a file
                 .map(new Smoothing())
                 .map(new FindPatterns());

featureList.writeAsFormattedText( ... )

I have the feeling that Flink does not distribute the independent tasks 
on the available nodes but executes everything on only one node.


View raw message