Mailing-List: contact hdfs-user-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: hdfs-user@hadoop.apache.org
Received-SPF: neutral (nike.apache.org: local policy)
MIME-Version: 1.0
In-Reply-To: <AANLkTikKUQdkWBocmO5A8zC-FRsdyWHc6pjjCWFd0CMV@mail.gmail.com>
References: <AANLkTiknrOevzqRndBg1=J_=1bYVvjVkveO0YYmjzD8_@mail.gmail.com>
	<AANLkTikKUQdkWBocmO5A8zC-FRsdyWHc6pjjCWFd0CMV@mail.gmail.com>
Date: Thu, 17 Mar 2011 16:21:16 +0200
Message-ID: <AANLkTi=qnp-kc1-JN6rVSReqKthuDCYWAwFx=fEwhZDz@mail.gmail.com>
Subject: Re: hdsf block size cont.
From: Lior Schachter <liors@infolinks.com>
To: hdfs-user@hadoop.apache.org
Content-Type: multipart/alternative; boundary=bcaec50fe0cdd4c781049eae5fa4

--bcaec50fe0cdd4c781049eae5fa4
Content-Type: text/plain; charset=ISO-8859-1

Currently each gzip file is about 250MB (*60files=15G) so we have 256M
blocks.

However I understand that in order to utilize better M/R parallel processing
smaller files/blocks are better.

So maybe having 128M gzip files with coreesponding 128M block size would be
better?


On Thu, Mar 17, 2011 at 4:05 PM, Harsh J <qwertymaniac@gmail.com> wrote:

> On Thu, Mar 17, 2011 at 6:40 PM, Lior Schachter <liors@infolinks.com>
> wrote:
> > Hi,
> > If I have is big gzip files (>>block size) does the M/R will split a
> single
> > file to multiple blocks and send them to different mappers ?
> > The behavior I currently see is that a map is still open per file (and
> not
> > per block).
>
> Yes this is true. This is the current behavior with GZip files (since
> they can't be split and decompressed right out). I had somehow managed
> to ignore the GZIP part of your question in the previous thread!
>
> But still, 60~ files worth 15 GB total would mean at least 3 GB per
> file. And seeing how they can't really be split out right now, it
> would be good to have them use up only a single block. Perhaps for
> these files alone you may use a block size of 3-4 GB, thereby making
> these file reads more local for your record readers?
>
> In future, HADOOP-7076 plans to add a pseudo-splitting way for plain
> GZIP files, though. 'Concatenated' GZIP files could be split
> (HADOOP-6835) across mappers as well (as demonstrated in PIG-42).
>
> --
> Harsh J
> http://harshj.com
>

--bcaec50fe0cdd4c781049eae5fa4
Content-Type: text/html; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr">Currently each gzip file is about 250MB (*60files=3D15G) s=
o we have 256M blocks.<br><br>However I understand that in order to utilize=
 better M/R parallel processing smaller files/blocks are better.<br>
<br>
So maybe having 128M gzip files with coreesponding 128M block size would be=
 better?<br><br><br><div class=3D"gmail_quote">On Thu, Mar 17, 2011 at 4:05=
 PM, Harsh J <span dir=3D"ltr">&lt;<a href=3D"mailto:qwertymaniac@gmail.com=
">qwertymaniac@gmail.com</a>&gt;</span> wrote:<br>
<blockquote class=3D"gmail_quote" style=3D"margin: 0pt 0pt 0pt 0.8ex; borde=
r-left: 1px solid rgb(204, 204, 204); padding-left: 1ex;"><div class=3D"im"=
>On Thu, Mar 17, 2011 at 6:40 PM, Lior Schachter &lt;<a href=3D"mailto:lior=
s@infolinks.com">liors@infolinks.com</a>&gt; wrote:<br>

&gt; Hi,<br>
&gt; If I have is big gzip files (&gt;&gt;block size) does the M/R will spl=
it a single<br>
&gt; file to multiple blocks and send them to different mappers ?<br>
&gt; The behavior I currently see is that a map is still open per file (and=
 not<br>
&gt; per block).<br>
<br>
</div>Yes this is true. This is the current behavior with GZip files (since=
<br>
they can&#39;t be split and decompressed right out). I had somehow managed<=
br>
to ignore the GZIP part of your question in the previous thread!<br>
<br>
But still, 60~ files worth 15 GB total would mean at least 3 GB per<br>
file. And seeing how they can&#39;t really be split out right now, it<br>
would be good to have them use up only a single block. Perhaps for<br>
these files alone you may use a block size of 3-4 GB, thereby making<br>
these file reads more local for your record readers?<br>
<br>
In future, HADOOP-7076 plans to add a pseudo-splitting way for plain<br>
GZIP files, though. &#39;Concatenated&#39; GZIP files could be split<br>
(HADOOP-6835) across mappers as well (as demonstrated in PIG-42).<br>
<font color=3D"#888888"><br>
--<br>
Harsh J<br>
<a href=3D"http://harshj.com" target=3D"_blank">http://harshj.com</a><br>
</font></blockquote></div><br></div>

--bcaec50fe0cdd4c781049eae5fa4--