flink-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Alex Reid <alex.james.r...@gmail.com>
Subject Reading files from an S3 folder
Date Tue, 22 Nov 2016 18:42:18 GMT
Hi, I've been playing around with using apache flink to process some data,
and I'm starting out using the batch DataSet API.

To start, I read in some data from files in an S3 folder:

DataSet<String> records = env.readTextFile("s3://my-s3-bucket/some-folder/");

Within the folder, there are 20 gzipped files, and I have 20
node/tasks run (so parallel 20). It looks like each node is reading in
ALL the files (whole folder), but what I really want is for each
node/task to read in 1 file each and each process the data within the
file they read in.

Is this expected behavior? Am I suppose to be doing something
different here to get the results I want?


View raw message