flink-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Flavio Pompermaier <pomperma...@okkam.it>
Subject Re: How to make Flink read all files in HDFS folder and do transformations on th e data
Date Sat, 07 May 2016 10:25:42 GMT
I had the same issue :)
I resolved it reading all file paths in a collection, then using this code:

env.fromCollection(filePaths).rebalance().map(file2pojo)

You can have your dataset of Pojos!

The rebalance() is necessary to exploit parallelism,otherwise the pipeline
will be executed with parallelism 1.

Best,
Flavio
On 7 May 2016 12:13, "Palle" <palle@sport.dk> wrote:

Hi there.

I've got a HDFS folder containing a lot of files. All files contains a lot
of JSON objects, one for each line. I will have several TB in the HDFS
folder.

My plan is to make Flink read all files and all JSON objects and then do
some analysis on the data, actually very similar to the
flatMap/groupBy/reduceGroup transformations that is done in the WordCount
example.

But I am a bit stuck, because I cannot seem to find out how to make Flink
read all files in a HDFS dir and then perform the transformations on the
data. I have googled quite a bit and also looked in the Flink API and mail
history.

Can anyone point me to an example where Flink is used to read all files in
a HDFS folder and then do transformations on the data)?

- and a second question: Is there an elegant way to make Flink handle the
JSON objects? - can they be converted to POJOs by something similar to the
pojoType() method?

/Palle

Mime
View raw message