Mailing-List: contact user-help@flink.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@flink.apache.org
MIME-Version: 1.0
Sender: ewenstephan@gmail.com
In-Reply-To: 
 <CAELUF_CHgYPJR0J14WbQTMTudu7EB8Qvb6+JERNxgn5cFjC3tA@mail.gmail.com>
References: 
 <CAELUF_AG7d1gqnX_aE6chxtJCyuAOdS9yT8rmNFLLGUOSZs8LA@mail.gmail.com>
	<CAAdrtT28isO4wLHNwt6_Q1dFhNX=cu3B=4tNVb1HQHqFSVco0A@mail.gmail.com>
	<CAELUF_Csua3q_pFLZrfxQUZtdm0DH-sqow-Jj_b6Qe02gEpxmw@mail.gmail.com>
	<CAAdrtT34cPfhrrwwgBXsa=gL+ao+GOu+ydpGWNP==4EtJchOzg@mail.gmail.com>
	<CANC1h_u9r_wuZHqwPUhTn-sJ8A_c6A_YTvyG-GwrhcY2EzUKzQ@mail.gmail.com>
	<CAELUF_CHgYPJR0J14WbQTMTudu7EB8Qvb6+JERNxgn5cFjC3tA@mail.gmail.com>
Date: Wed, 18 Nov 2015 17:52:28 +0100
Message-ID: 
 <CANC1h_uFadwZSu8mcrTgzvwcd4fx8_RbrA6sup=gRf_yaSthyQ@mail.gmail.com>
Subject: Re: Parallel file read in LocalEnvironment
From: Stephan Ewen <sewen@apache.org>
To: user@flink.apache.org
Content-Type: multipart/alternative; boundary=001a1146fde0af1d8a0524d37711

--001a1146fde0af1d8a0524d37711
Content-Type: text/plain; charset=UTF-8

Late answer, sorry:

The splits are created in the JobManager, so the sub submission should not
be affected by that.

The assignment of splits to workers is very fast, so many splits with small
data is not very different from few splits with large data.

Lines are never materialized and the operators do not work differently
based on different numbers of splits.

On Wed, Oct 7, 2015 at 4:26 PM, Flavio Pompermaier <pompermaier@okkam.it>
wrote:

> I've tried to split my huge file by lines count (using the bash command
> split -l) in 2 different ways:
>
>    1. small lines count (huge number of small files)
>    2. big lines count (small number of big files)
>
> I can't understand why the time required to effectively start the job is
> more or less the same
>
>    - in 1. it takes a lot to fetch the file list (~50.000) and the split
>    assigner is fast to assign the splits (but also being fast they are a lot)
>    - in 2. Flink is fast in fetch the file list but it's extremely slow
>    to generate the splits to assign
>
> Initially I was thinking that Flink was eagerly materializing the lines
> somewhere but both the memory and the disks doesn't increase.
> What is going on underneath? Is it normal?
>
> Thanks in advance,
> Flavio
>
>
>
> On Wed, Oct 7, 2015 at 3:27 PM, Stephan Ewen <sewen@apache.org> wrote:
>
>> The split functionality is in the FileInputFormat and the functionality
>> that takes care of lines across splits is in the DelimitedIntputFormat.
>>
>> On Wed, Oct 7, 2015 at 3:24 PM, Fabian Hueske <fhueske@gmail.com> wrote:
>>
>>> I'm sorry there is no such documentation.
>>> You need to look at the code :-(
>>>
>>> 2015-10-07 15:19 GMT+02:00 Flavio Pompermaier <pompermaier@okkam.it>:
>>>
>>>> And what is the split policy for the FileInputFormat?it depends on the
>>>> fs block size?
>>>> Is there a pointer to the several flink input formats and a description
>>>> of their internals?
>>>>
>>>> On Wed, Oct 7, 2015 at 3:09 PM, Fabian Hueske <fhueske@gmail.com>
>>>> wrote:
>>>>
>>>>> Hi Flavio,
>>>>>
>>>>> it is not possible to split by line count because that would mean to
>>>>> read and parse the file just for splitting.
>>>>>
>>>>> Parallel processing of data sources depends on the input splits
>>>>> created by the InputFormat. Local files can be split just like files in
>>>>> HDFS. Usually, each file corresponds to at least one split but multiple
>>>>> files could also be put into a single split if necessary.The logic for that
>>>>> would go into to the InputFormat.createInputSplits() method.
>>>>>
>>>>> Cheers, Fabian
>>>>>
>>>>> 2015-10-07 14:47 GMT+02:00 Flavio Pompermaier <pompermaier@okkam.it>:
>>>>>
>>>>>> Hi to all,
>>>>>>
>>>>>> is there a way to split a single local file by line count (e.g. a
>>>>>> split every 100 lines) in a LocalEnvironment to speed up a simple map
>>>>>> function? For me it is not very clear how the local files (files into
>>>>>> directory if recursive=true) are managed by Flink..is there any ref to this
>>>>>> internals?
>>>>>>
>>>>>> Best,
>>>>>> Flavio
>>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>
>>
>

--001a1146fde0af1d8a0524d37711
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr">Late answer, sorry:<div><br></div><div>The splits are crea=
ted in the JobManager, so the sub submission should not be affected by that=
.</div><div><br></div><div>The assignment of splits to workers is very fast=
, so many splits with small data is not very different from few splits with=
 large data.</div><div><br></div><div>Lines are never materialized and the =
operators do not work differently based on different numbers of splits.</di=
v></div><div class=3D"gmail_extra"><br><div class=3D"gmail_quote">On Wed, O=
ct 7, 2015 at 4:26 PM, Flavio Pompermaier <span dir=3D"ltr">&lt;<a href=3D"=
mailto:pompermaier@okkam.it" target=3D"_blank">pompermaier@okkam.it</a>&gt;=
</span> wrote:<br><blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .=
8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir=3D"ltr">I&#39;ve =
tried to split my huge file by lines count (using the bash command split -l=
) in 2 different ways:<div><ol><li>small lines count (huge number of small =
files)</li><li>big lines count (small number of big files)</li></ol><div>I =
can&#39;t understand why the time required to effectively start the job is =
more or less the same</div><div><ul><li>in 1. it takes a lot to fetch the f=
ile list (~50.000) and the split assigner is fast to assign the splits (but=
 also being fast they are a lot)</li><li>in 2. Flink is fast in fetch the f=
ile list but it&#39;s extremely slow to generate the splits to assign<br></=
li></ul><div>Initially I was thinking that Flink was eagerly materializing =
the lines somewhere but both the memory and the disks doesn&#39;t increase.=
</div></div></div><div>What is going on underneath? Is it normal?</div><div=
><br></div><div>Thanks in advance,</div><div>Flavio</div><div><div class=3D=
"h5"><div><br></div><div><br></div><div class=3D"gmail_extra"><br><div clas=
s=3D"gmail_quote">On Wed, Oct 7, 2015 at 3:27 PM, Stephan Ewen <span dir=3D=
"ltr">&lt;<a href=3D"mailto:sewen@apache.org" target=3D"_blank">sewen@apach=
e.org</a>&gt;</span> wrote:<br><blockquote class=3D"gmail_quote" style=3D"m=
argin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir=3D"l=
tr">The split functionality is in the FileInputFormat and the functionality=
 that takes care of lines across splits is in the DelimitedIntputFormat.</d=
iv><div><div><div class=3D"gmail_extra"><br><div class=3D"gmail_quote">On W=
ed, Oct 7, 2015 at 3:24 PM, Fabian Hueske <span dir=3D"ltr">&lt;<a href=3D"=
mailto:fhueske@gmail.com" target=3D"_blank">fhueske@gmail.com</a>&gt;</span=
> wrote:<br><blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;bo=
rder-left:1px #ccc solid;padding-left:1ex"><div dir=3D"ltr">I&#39;m sorry t=
here is no such documentation. <br>You need to look at the code :-(<br></di=
v><div><div><div class=3D"gmail_extra"><br><div class=3D"gmail_quote">2015-=
10-07 15:19 GMT+02:00 Flavio Pompermaier <span dir=3D"ltr">&lt;<a href=3D"m=
ailto:pompermaier@okkam.it" target=3D"_blank">pompermaier@okkam.it</a>&gt;<=
/span>:<br><blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;bor=
der-left:1px #ccc solid;padding-left:1ex"><div dir=3D"ltr">And what is the =
split policy for the FileInputFormat?it depends on the fs block size?<div>I=
s there a pointer to the several flink input formats and a description of t=
heir internals?</div><div><div><div class=3D"gmail_extra"><br><div class=3D=
"gmail_quote">On Wed, Oct 7, 2015 at 3:09 PM, Fabian Hueske <span dir=3D"lt=
r">&lt;<a href=3D"mailto:fhueske@gmail.com" target=3D"_blank">fhueske@gmail=
.com</a>&gt;</span> wrote:<br><blockquote class=3D"gmail_quote" style=3D"ma=
rgin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir=3D"lt=
r"><div><div><div><div>Hi Flavio,<br><br></div><div></div>it is not possibl=
e to split by line count because that would mean to read and parse the file=
 just for splitting.<br><br>Parallel processing of data sources depends on =
the input splits created by the InputFormat. Local files can be split just =
like files in HDFS. Usually, each file corresponds to at least one split bu=
t multiple files could also be put into a single split if necessary.The log=
ic for that would go into to the InputFormat.createInputSplits() method.<br=
></div></div><br></div><div>Cheers, Fabian<br></div></div><div><div><div cl=
ass=3D"gmail_extra"><br><div class=3D"gmail_quote">2015-10-07 14:47 GMT+02:=
00 Flavio Pompermaier <span dir=3D"ltr">&lt;<a href=3D"mailto:pompermaier@o=
kkam.it" target=3D"_blank">pompermaier@okkam.it</a>&gt;</span>:<br><blockqu=
ote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1px #ccc s=
olid;padding-left:1ex"><div dir=3D"ltr">Hi to all,<div><div dir=3D"ltr"><p>=
</p><p></p><p></p><p></p></div></div>
<div>is there a way to split a single local file by line count (e.g. a spli=
t every 100 lines) in a LocalEnvironment to speed up a simple map function?=
 For me it is not very clear how the local files (files into directory if r=
ecursive=3Dtrue) are managed by Flink..is there any ref to this internals?<=
/div><div><br></div><div>Best,</div><div>Flavio</div></div>
</blockquote></div><br></div>
</div></div></blockquote></div><br><div><div dir=3D"ltr"><br><p></p><p></p>=
<p></p><p></p></div></div>
</div></div></div></div>
</blockquote></div><br></div>
</div></div></blockquote></div><br></div>
</div></div></blockquote></div><div><div dir=3D"ltr"><br><p></p><p></p><p><=
/p><p></p></div></div>
</div></div></div></div>
</blockquote></div><br></div>

--001a1146fde0af1d8a0524d37711--