flink-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Robert Metzger <rmetz...@apache.org>
Subject Re: Reading files from an S3 folder
Date Wed, 23 Nov 2016 16:45:46 GMT
This is not the expected behavior.
Each parallel instance should read only one file. The files should not be
read multiple times by the different parallel instances.
How did you check / find out that each node is reading all the data?


On Tue, Nov 22, 2016 at 7:42 PM, Alex Reid <alex.james.reid@gmail.com>

> Hi, I've been playing around with using apache flink to process some data,
> and I'm starting out using the batch DataSet API.
> To start, I read in some data from files in an S3 folder:
> DataSet<String> records = env.readTextFile("s3://my-s3-bucket/some-folder/");
> Within the folder, there are 20 gzipped files, and I have 20 node/tasks run (so parallel
20). It looks like each node is reading in ALL the files (whole folder), but what I really
want is for each node/task to read in 1 file each and each process the data within the file
they read in.
> Is this expected behavior? Am I suppose to be doing something different here to get the
results I want?
> Thanks.

View raw message