flink-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Stephan Ewen <se...@apache.org>
Subject Re: Parallel file read in LocalEnvironment
Date Wed, 07 Oct 2015 13:27:19 GMT
The split functionality is in the FileInputFormat and the functionality
that takes care of lines across splits is in the DelimitedIntputFormat.

On Wed, Oct 7, 2015 at 3:24 PM, Fabian Hueske <fhueske@gmail.com> wrote:

> I'm sorry there is no such documentation.
> You need to look at the code :-(
>
> 2015-10-07 15:19 GMT+02:00 Flavio Pompermaier <pompermaier@okkam.it>:
>
>> And what is the split policy for the FileInputFormat?it depends on the fs
>> block size?
>> Is there a pointer to the several flink input formats and a description
>> of their internals?
>>
>> On Wed, Oct 7, 2015 at 3:09 PM, Fabian Hueske <fhueske@gmail.com> wrote:
>>
>>> Hi Flavio,
>>>
>>> it is not possible to split by line count because that would mean to
>>> read and parse the file just for splitting.
>>>
>>> Parallel processing of data sources depends on the input splits created
>>> by the InputFormat. Local files can be split just like files in HDFS.
>>> Usually, each file corresponds to at least one split but multiple files
>>> could also be put into a single split if necessary.The logic for that would
>>> go into to the InputFormat.createInputSplits() method.
>>>
>>> Cheers, Fabian
>>>
>>> 2015-10-07 14:47 GMT+02:00 Flavio Pompermaier <pompermaier@okkam.it>:
>>>
>>>> Hi to all,
>>>>
>>>> is there a way to split a single local file by line count (e.g. a split
>>>> every 100 lines) in a LocalEnvironment to speed up a simple map function?
>>>> For me it is not very clear how the local files (files into directory if
>>>> recursive=true) are managed by Flink..is there any ref to this internals?
>>>>
>>>> Best,
>>>> Flavio
>>>>
>>>
>>>
>>
>>
>

Mime
View raw message