flink-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Flavio Pompermaier <pomperma...@okkam.it>
Subject Re: Parallel file read in LocalEnvironment
Date Wed, 07 Oct 2015 14:26:06 GMT
I've tried to split my huge file by lines count (using the bash command
split -l) in 2 different ways:

   1. small lines count (huge number of small files)
   2. big lines count (small number of big files)

I can't understand why the time required to effectively start the job is
more or less the same

   - in 1. it takes a lot to fetch the file list (~50.000) and the split
   assigner is fast to assign the splits (but also being fast they are a lot)
   - in 2. Flink is fast in fetch the file list but it's extremely slow to
   generate the splits to assign

Initially I was thinking that Flink was eagerly materializing the lines
somewhere but both the memory and the disks doesn't increase.
What is going on underneath? Is it normal?

Thanks in advance,
Flavio



On Wed, Oct 7, 2015 at 3:27 PM, Stephan Ewen <sewen@apache.org> wrote:

> The split functionality is in the FileInputFormat and the functionality
> that takes care of lines across splits is in the DelimitedIntputFormat.
>
> On Wed, Oct 7, 2015 at 3:24 PM, Fabian Hueske <fhueske@gmail.com> wrote:
>
>> I'm sorry there is no such documentation.
>> You need to look at the code :-(
>>
>> 2015-10-07 15:19 GMT+02:00 Flavio Pompermaier <pompermaier@okkam.it>:
>>
>>> And what is the split policy for the FileInputFormat?it depends on the
>>> fs block size?
>>> Is there a pointer to the several flink input formats and a description
>>> of their internals?
>>>
>>> On Wed, Oct 7, 2015 at 3:09 PM, Fabian Hueske <fhueske@gmail.com> wrote:
>>>
>>>> Hi Flavio,
>>>>
>>>> it is not possible to split by line count because that would mean to
>>>> read and parse the file just for splitting.
>>>>
>>>> Parallel processing of data sources depends on the input splits created
>>>> by the InputFormat. Local files can be split just like files in HDFS.
>>>> Usually, each file corresponds to at least one split but multiple files
>>>> could also be put into a single split if necessary.The logic for that would
>>>> go into to the InputFormat.createInputSplits() method.
>>>>
>>>> Cheers, Fabian
>>>>
>>>> 2015-10-07 14:47 GMT+02:00 Flavio Pompermaier <pompermaier@okkam.it>:
>>>>
>>>>> Hi to all,
>>>>>
>>>>> is there a way to split a single local file by line count (e.g. a
>>>>> split every 100 lines) in a LocalEnvironment to speed up a simple map
>>>>> function? For me it is not very clear how the local files (files into
>>>>> directory if recursive=true) are managed by Flink..is there any ref to
this
>>>>> internals?
>>>>>
>>>>> Best,
>>>>> Flavio
>>>>>
>>>>
>>>>
>>>
>>>
>>
>

Mime
View raw message