flink-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Alex Reid <alex.james.r...@gmail.com>
Subject Re: Reading files from an S3 folder
Date Wed, 23 Nov 2016 17:35:24 GMT
Each file is ~1.8G compressed (and about 15G uncompressed, so a little over
300G total for all the files).

In the Web Client UI, when I look at the Plan, I click on the subtask for
reading in the files, I see a line for each host and the Bytes Sent for
each host is like 350G.

The job takes longer than I'd expect, so just trying to track down where
the time spent / is it doing what I'm expecting it to.

On Wed, Nov 23, 2016 at 8:45 AM, Robert Metzger <rmetzger@apache.org> wrote:

> Hi,
> This is not the expected behavior.
> Each parallel instance should read only one file. The files should not be
> read multiple times by the different parallel instances.
> How did you check / find out that each node is reading all the data?
>
> Regards,
> Robert
>
> On Tue, Nov 22, 2016 at 7:42 PM, Alex Reid <alex.james.reid@gmail.com>
> wrote:
>
>> Hi, I've been playing around with using apache flink to process some
>> data, and I'm starting out using the batch DataSet API.
>>
>> To start, I read in some data from files in an S3 folder:
>>
>> DataSet<String> records = env.readTextFile("s3://my-s3-bucket/some-folder/");
>>
>>
>> Within the folder, there are 20 gzipped files, and I have 20 node/tasks run (so parallel
20). It looks like each node is reading in ALL the files (whole folder), but what I really
want is for each node/task to read in 1 file each and each process the data within the file
they read in.
>>
>> Is this expected behavior? Am I suppose to be doing something different here to get
the results I want?
>>
>> Thanks.
>>
>>
>

Mime
View raw message