Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@hadoop.apache.org
Received-SPF: pass (athena.apache.org: local policy)
DomainKey-Signature: a=rsa-sha1; q=dns; c=nofws;
  s=s1024; d=yahoo.com;
  h=X-YMail-OSG:Received:X-Rocket-MIMEInfo:X-Mailer:References:Message-ID:Date:From:Reply-To:Subject:To:Cc:In-Reply-To:MIME-Version:Content-Type;
  b=K8uH/Gq827aCQOF9zuHKuXMWjcSArlMz8vure0IeQXImp1JaXKpbI7PjDR2ljcH3cKPEr3RjDTW6RatWPWkdBMWJeSZhpB9M5mXaAnarJYugAR9RDY+LS96b1OvT+nwEkTPx9n8Bbr1muf+bjtvy01k57nHyLE98BLsvXW6XJSo=;
References: 
 <CAOeeJfgeZdRJw+D0F=sY-eexwCM1wM+OroUqJtupsr-WiES1Dw@mail.gmail.com>
 <CABCYYb-kZcougMSDdZqgsG2EEgXXkHq63A6qpP-cvuKB1PeD7A@mail.gmail.com>
 <392626559-1349151816-cardhu_decombobulator_blackberry.rim.net-1796317408-@b3.c16.bise7.blackberry>
 <CAOeeJfhBcq6YN5kHYKeQU67p3rfCnQ49s5ecf=rYYgZzTJfUjg@mail.gmail.com>
 <1349189521.30522.YahooMailNeo@web160703.mail.bf1.yahoo.com>
 <CAOeeJfhH0KXmKz+Vv+SaXdfsVHTLuqy5RLrAePMpHHVXg5Ck0Q@mail.gmail.com>
Message-ID: <1349798995.86558.YahooMailNeo@web160702.mail.bf1.yahoo.com>
Date: Tue, 9 Oct 2012 09:09:55 -0700 (PDT)
From: Raj Vishwanathan <rajvish@yahoo.com>
Reply-To: Raj Vishwanathan <rajvish@yahoo.com>
Subject: Re: File block size use
To: "user@hadoop.apache.org" <user@hadoop.apache.org>
Cc: "annalahoud@gmail.com" <annalahoud@gmail.com>
In-Reply-To: 
 <CAOeeJfhH0KXmKz+Vv+SaXdfsVHTLuqy5RLrAePMpHHVXg5Ck0Q@mail.gmail.com>
MIME-Version: 1.0
Content-Type: multipart/alternative;
 boundary="-2130163251-1464586994-1349798995=:86558"

---2130163251-1464586994-1349798995=:86558
Content-Type: text/plain; charset=iso-8859-1
Content-Transfer-Encoding: quoted-printable

Anna=0A=0AI misunderstood your problem. I thought you wanted to change the =
block size of every file. I didn' t realize that you were aggregating multi=
ple small files into different, albeit smaller, set of larger files of a bi=
gger block size=A0=0Ato improve performance.=A0=0A=0AI think as Chris sugge=
sted you need to have a custom M/R job or you could probably get away with =
some scripting magic :-)=0A=0ARaj=0A=0A=0A=0A>_____________________________=
___=0A> From: Anna Lahoud <annalahoud@gmail.com>=0A>To: user@hadoop.apache.=
org; Raj Vishwanathan <rajvish@yahoo.com> =0A>Sent: Tuesday, October 9, 201=
2 7:01 AM=0A>Subject: Re: File block size use=0A> =0A>=0A>Raj - I was not a=
ble to get this to work either. =0A>=0A>=0A>On Tue, Oct 2, 2012 at 10:52 AM=
, Raj Vishwanathan <rajvish@yahoo.com> wrote:=0A>=0A>I haven't tried it but=
 this should also work=0A>>=0A>>=0A>>=A0hadoop =A0fs =A0-Ddfs.block.size=3D=
<NEW BLOCK SIZE> -cp =A0src dest=0A>>=0A>>=0A>>=0A>>Raj=0A>>=0A>>=0A>>=0A>>=
>________________________________=0A>>> From: Anna Lahoud <annalahoud@gmail=
.com>=0A>>>To: user@hadoop.apache.org; bejoy.hadoop@gmail.com =0A>>>Sent: T=
uesday, October 2, 2012 7:17 AM=0A>>>=0A>>>Subject: Re: File block size use=
=0A>>> =0A>>>=0A>>>=0A>>>Thank you. I will try today.=0A>>>=0A>>>=0A>>>On T=
ue, Oct 2, 2012 at 12:23 AM, Bejoy KS <bejoy.hadoop@gmail.com> wrote:=0A>>>=
=0A>>>Hi Anna=0A>>>>=0A>>>>If you want to increase the block size of existi=
ng files. You can use a Identity Mapper with no reducer.  Set the min and m=
ax split sizes to your requirement (512Mb). Use SequenceFileInputFormat and=
 SequenceFileOutputFormat for your job.=0A>>>>Your job should be done.=0A>>=
>>=0A>>>>=0A>>>>Regards=0A>>>>Bejoy KS=0A>>>>=0A>>>>Sent from handheld, ple=
ase excuse typos.=0A>>>>________________________________=0A>>>>=0A>>>>From:=
  Chris Nauroth <cnauroth@hortonworks.com> =0A>>>>Date: Mon, 1 Oct 2012 21:=
12:58 -0700=0A>>>>To: <user@hadoop.apache.org>=0A>>>>ReplyTo:  user@hadoop.=
apache.org =0A>>>>Subject: Re: File block size use=0A>>>>=0A>>>>Hello Anna,=
=0A>>>>=0A>>>>=0A>>>>If I understand correctly, you have a set of multiple =
sequence files, each much smaller than the desired block size, and you want=
 to concatenate them into a set of fewer files, each one more closely align=
ed to your desired block size. =A0Presumably, the goal is to improve throug=
hput of map reduce jobs using those files as input by running fewer map tas=
ks, reading a larger number of input records.=0A>>>>=0A>>>>=0A>>>>Whenever =
I've had this kind of requirement, I've run a custom map reduce job to impl=
ement the file consolidation. =A0In my case, I was typically working with T=
extInputFormat (not sequence files). =A0I used IdentityMapper and a custom =
reducer that passed through all values but with key set to NullWritable, be=
cause the keys (input file offsets in the case of TextInputFormat) were not=
 valuable data. =A0For my input data, this was sufficient to achieve fairly=
 even distribution of data across the reducer tasks, and I could reasonably=
 predict the input data set size, so I could reasonably set the number of r=
educers and get decent results. =A0(This may or may not be true for your da=
ta set though.)=0A>>>>=0A>>>>=0A>>>>A weakness of this approach is that the=
 keys must pass from the map tasks to the reduce tasks, only to get discard=
ed before writing the final output. =A0Also, the distribution of input reco=
rds to reduce tasks is not truly random, and therefore the reduce output fi=
les may be uneven in size. =A0This could be solved by writing NullWritable =
keys out of the map task instead of the reduce task and writing a custom im=
plementation of Partitioner to distribute them randomly.=0A>>>>=0A>>>>=0A>>=
>>To expand on this idea, it could be possible to inspect the FileStatus of=
 each input, sum the values of FileStatus.getLen(), and then use that infor=
mation to make a decision about how many reducers to run (and therefore app=
roximately set a target output file size). =A0I'm not aware of any built-in=
 or external utilities that do this for you though.=0A>>>>=0A>>>>=0A>>>>Hop=
e this helps,=0A>>>>--Chris=0A>>>>=0A>>>>=0A>>>>On Mon, Oct 1, 2012 at 11:3=
0 AM, Anna Lahoud <annalahoud@gmail.com> wrote:=0A>>>>=0A>>>>I would like t=
o be able to resize a set of inputs, already in SequenceFile format, to be =
larger. =0A>>>>>=0A>>>>>I have tried 'hadoop distcp -Ddfs.block.size=3D$[64=
*1024*1024]' and did not get what I expected. The outputs were exactly the =
same as the inputs. =0A>>>>>=0A>>>>>I also tried running a job with an Iden=
tityMapper and IdentityReducer. Although that approaches a better solution,=
 it still requires that I know in advance how many reducers I need to get b=
etter file sizes. =0A>>>>>=0A>>>>>I was looking at the SequenceFile.Writer =
constructors and noticed that there are block size parameters that can be u=
sed. Using a writer constructed with a 512MB block size, there is nothing t=
hat splits the output and I simply get a single file the size of my inputs.=
 =0A>>>>>=0A>>>>>What is the current standard for combining sequence files =
to create larger files for map-reduce jobs? I have seen code that tracks wh=
at it writes into the file, but that seems like the long version. I am hopi=
ng there is a shorter path.=0A>>>>>=0A>>>>>Thank you.=0A>>>>>=0A>>>>>Anna=
=0A>>>>>=0A>>>>>=0A>>>>=0A>>>=0A>>>=0A>>>=0A>=0A>=0A>
---2130163251-1464586994-1349798995=:86558
Content-Type: text/html; charset=iso-8859-1
Content-Transfer-Encoding: quoted-printable

<html><body><div style=3D"color:#000; background-color:#fff; font-family:ar=
ial, helvetica, sans-serif;font-size:10pt"><div><span>Anna</span></div><div=
 style=3D"color: rgb(0, 0, 0); font-size: 13.600000381469727px; font-family=
: arial, helvetica, sans-serif; background-color: transparent; font-style: =
normal; "><span><br></span></div><div style=3D"color: rgb(0, 0, 0); font-si=
ze: 13.600000381469727px; font-family: arial, helvetica, sans-serif; backgr=
ound-color: transparent; font-style: normal; "><span>I misunderstood your p=
roblem. I thought you wanted to change the block size of every file. I didn=
' t realize that you were aggregating multiple small files into different, =
albeit smaller, set of larger files of a bigger block size&nbsp;</span></di=
v><div style=3D"color: rgb(0, 0, 0); font-size: 13.600000381469727px; font-=
family: arial, helvetica, sans-serif; background-color: transparent; font-s=
tyle: normal; "><span>to improve performance.&nbsp;</span></div><div
 style=3D"color: rgb(0, 0, 0); font-size: 13.600000381469727px; font-family=
: arial, helvetica, sans-serif; background-color: transparent; font-style: =
normal; "><span><br></span></div><div style=3D"color: rgb(0, 0, 0); font-si=
ze: 13.600000381469727px; font-family: arial, helvetica, sans-serif; backgr=
ound-color: transparent; font-style: normal; "><span>I think as Chris sugge=
sted you need to have a custom M/R job or you could probably get away with =
some scripting magic :-)</span></div><div style=3D"color: rgb(0, 0, 0); fon=
t-size: 13.600000381469727px; font-family: arial, helvetica, sans-serif; ba=
ckground-color: transparent; font-style: normal; "><span><br></span></div><=
div style=3D"color: rgb(0, 0, 0); font-size: 13.600000381469727px; font-fam=
ily: arial, helvetica, sans-serif; background-color: transparent; font-styl=
e: normal; "><span>Raj</span></div><div><br><blockquote style=3D"border-lef=
t: 2px solid rgb(16, 16, 255); margin-left: 5px; margin-top: 5px;
 padding-left: 5px;">  <div style=3D"font-family: arial, helvetica, sans-se=
rif; font-size: 10pt; "> <div style=3D"font-family: 'times new roman', 'new=
 york', times, serif; font-size: 12pt; "> <div dir=3D"ltr"> <font size=3D"2=
" face=3D"Arial"> <hr size=3D"1">  <b><span style=3D"font-weight:bold;">Fro=
m:</span></b> Anna Lahoud &lt;annalahoud@gmail.com&gt;<br> <b><span style=
=3D"font-weight: bold;">To:</span></b> user@hadoop.apache.org; Raj Vishwana=
than &lt;rajvish@yahoo.com&gt; <br> <b><span style=3D"font-weight: bold;">S=
ent:</span></b> Tuesday, October 9, 2012 7:01 AM<br> <b><span style=3D"font=
-weight: bold;">Subject:</span></b> Re: File block size use<br> </font> </d=
iv> <br>=0A<div id=3D"yiv996111913">Raj - I was not able to get this to wor=
k either. <br><br><div class=3D"yiv996111913gmail_quote">On Tue, Oct 2, 201=
2 at 10:52 AM, Raj Vishwanathan <span dir=3D"ltr">&lt;<a rel=3D"nofollow" y=
mailto=3D"mailto:rajvish@yahoo.com" target=3D"_blank" href=3D"mailto:rajvis=
h@yahoo.com">rajvish@yahoo.com</a>&gt;</span> wrote:<br>=0A<blockquote clas=
s=3D"yiv996111913gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1px #c=
cc solid;padding-left:1ex;"><div><div style=3D"font-size: 10pt; font-family=
: arial, helvetica, sans-serif; "><div style=3D"font-family: arial, helveti=
ca, sans-serif; font-size: 10pt; ">=0AI haven't tried it but this should al=
so work</div><div style=3D"font-family: arial, helvetica, sans-serif; font-=
size: 10pt; "><br></div><div style=3D"background-color:transparent;"><font>=
&nbsp;hadoop &nbsp;fs &nbsp;-Ddfs.block.size=3D&lt;NEW BLOCK SIZE&gt; -cp &=
nbsp;src dest</font><br>=0A</div><div style=3D"font-style: normal; font-siz=
e: 16.363636016845703px; background-color: transparent; font-family: arial,=
 helvetica, sans-serif; "><font><br></font></div><div style=3D"font-family:=
 arial, helvetica, sans-serif; font-size: 10pt; ">=0ARaj</div><div style=3D=
"font-family: arial, helvetica, sans-serif; font-size: 10pt; "><br><blockqu=
ote style=3D"border-left:2px solid rgb(16,16,255);margin-left:5px;margin-to=
p:5px;padding-left:5px;">  <div style=3D"font-family: arial, helvetica, san=
s-serif; font-size: 10pt; ">=0A <div style=3D"font-size:12pt;"> <div dir=3D=
"ltr"> <font face=3D"Arial"> <hr size=3D"1">  <b><span style=3D"font-weight=
:bold;">From:</span></b> Anna Lahoud &lt;<a rel=3D"nofollow" ymailto=3D"mai=
lto:annalahoud@gmail.com" target=3D"_blank" href=3D"mailto:annalahoud@gmail=
.com">annalahoud@gmail.com</a>&gt;<br>=0A <b><span style=3D"font-weight:bol=
d;">To:</span></b> <a rel=3D"nofollow" ymailto=3D"mailto:user@hadoop.apache=
.org" target=3D"_blank" href=3D"mailto:user@hadoop.apache.org">user@hadoop.=
apache.org</a>; <a rel=3D"nofollow" ymailto=3D"mailto:bejoy.hadoop@gmail.co=
m" target=3D"_blank" href=3D"mailto:bejoy.hadoop@gmail.com">bejoy.hadoop@gm=
ail.com</a> <br>=0A <b><span style=3D"font-weight:bold;">Sent:</span></b> T=
uesday, October 2, 2012 7:17 AM<div><div class=3D"yiv996111913h5"><br> <b><=
span style=3D"font-weight:bold;">Subject:</span></b> Re: File block size us=
e<br> </div></div></font> </div><div>=0A<div class=3D"yiv996111913h5"> <br>=
=0A<div>Thank you. I will try today.<br><br><div>On Tue, Oct 2, 2012 at 12:=
23 AM, Bejoy KS <span dir=3D"ltr">&lt;<a rel=3D"nofollow" ymailto=3D"mailto=
:bejoy.hadoop@gmail.com" target=3D"_blank" href=3D"mailto:bejoy.hadoop@gmai=
l.com">bejoy.hadoop@gmail.com</a>&gt;</span> wrote:<br>=0A=0A<blockquote st=
yle=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex;"><u><=
/u><div>Hi Anna<br><br>If you want to increase the block size of existing f=
iles. You can use a Identity Mapper with no reducer.  Set the min and max s=
plit sizes to your requirement (512Mb). Use SequenceFileInputFormat and Seq=
uenceFileOutputFormat for your job.<br>=0A=0AYour job should be done.<br><b=
r> <div>Regards<br>Bejoy KS<br><br>Sent from handheld, please excuse typos.=
</div><hr><div><b>From: </b> Chris Nauroth &lt;<a rel=3D"nofollow" ymailto=
=3D"mailto:cnauroth@hortonworks.com" target=3D"_blank" href=3D"mailto:cnaur=
oth@hortonworks.com">cnauroth@hortonworks.com</a>&gt;=0A</div><div><b>Date:=
 </b>Mon, 1 Oct 2012 21:12:58 -0700</div><div><b>To: </b>&lt;<a rel=3D"nofo=
llow" ymailto=3D"mailto:user@hadoop.apache.org" target=3D"_blank" href=3D"m=
ailto:user@hadoop.apache.org">user@hadoop.apache.org</a>&gt;</div><div><b>R=
eplyTo: </b> <a rel=3D"nofollow" ymailto=3D"mailto:user@hadoop.apache.org" =
target=3D"_blank" href=3D"mailto:user@hadoop.apache.org">user@hadoop.apache=
.org</a>=0A</div><div><b>Subject: </b>Re: File block size use</div><div><di=
v><div><br></div>Hello Anna,<div><br></div><div>If I understand correctly, =
you have a set of multiple sequence files, each much smaller than the desir=
ed block size, and you want to concatenate them into a set of fewer files, =
each one more closely aligned to your desired block size. &nbsp;Presumably,=
 the goal is to improve throughput of map reduce jobs using those files as =
input by running fewer map tasks, reading a larger number of input records.=
</div>=0A=0A=0A<div><br></div><div>Whenever I've had this kind of requireme=
nt, I've run a custom map reduce job to implement the file consolidation. &=
nbsp;In my case, I was typically working with TextInputFormat (not sequence=
 files). &nbsp;I used IdentityMapper and a custom reducer that passed throu=
gh all values but with key set to NullWritable, because the keys (input fil=
e offsets in the case of TextInputFormat) were not valuable data. &nbsp;For=
 my input data, this was sufficient to achieve fairly even distribution of =
data across the reducer tasks, and I could reasonably predict the input dat=
a set size, so I could reasonably set the number of reducers and get decent=
 results. &nbsp;(This may or may not be true for your data set though.)</di=
v>=0A=0A=0A<div><br></div><div>A weakness of this approach is that the keys=
 must pass from the map tasks to the reduce tasks, only to get discarded be=
fore writing the final output. &nbsp;Also, the distribution of input record=
s to reduce tasks is not truly random, and therefore the reduce output file=
s may be uneven in size. &nbsp;This could be solved by writing NullWritable=
 keys out of the map task instead of the reduce task and writing a custom i=
mplementation of Partitioner to distribute them randomly.</div>=0A=0A=0A<di=
v><br></div><div>To expand on this idea, it could be possible to inspect th=
e FileStatus of each input, sum the values of FileStatus.getLen(), and then=
 use that information to make a decision about how many reducers to run (an=
d therefore approximately set a target output file size). &nbsp;I'm not awa=
re of any built-in or external utilities that do this for you though.</div>=
=0A=0A=0A<div><br></div><div>Hope this helps,</div><div>--Chris</div><div><=
br></div><div><div>On Mon, Oct 1, 2012 at 11:30 AM, Anna Lahoud <span dir=
=3D"ltr">&lt;<a rel=3D"nofollow" ymailto=3D"mailto:annalahoud@gmail.com" ta=
rget=3D"_blank" href=3D"mailto:annalahoud@gmail.com">annalahoud@gmail.com</=
a>&gt;</span> wrote:<br>=0A=0A=0A<blockquote style=3D"margin:0 0 0 .8ex;bor=
der-left:1px #ccc solid;padding-left:1ex;">I would like to be able to resiz=
e a set of inputs, already in SequenceFile format, to be larger. <br><br>I =
have tried 'hadoop distcp -Ddfs.block.size=3D$[64*1024*1024]' and did not g=
et what I expected. The outputs were exactly the same as the inputs. <br>=
=0A=0A=0A=0A<br>I also tried running a job with an IdentityMapper and Ident=
ityReducer. Although that approaches a better solution, it still requires t=
hat I know in advance how many reducers I need to get better file sizes. <b=
r><br>I was looking at the SequenceFile.Writer constructors and noticed tha=
t there are block size parameters that can be used. Using a writer construc=
ted with a 512MB block size, there is nothing that splits the output and I =
simply get a single file the size of my inputs. <br>=0A=0A=0A=0A<br>What is=
 the current standard for combining sequence files to create larger files f=
or map-reduce jobs? I have seen code that tracks what it writes into the fi=
le, but that seems like the long version. I am hoping there is a shorter pa=
th.<br>=0A=0A=0A=0A<br>Thank you.<span><font color=3D"#888888"><br><br>Anna=
<br><br>=0A</font></span></blockquote></div><br></div>=0A=0A</div></div></d=
iv></blockquote></div><br>=0A</div><br><br> </div></div></div> </div> </blo=
ckquote></div>   </div></div></blockquote></div><br>=0A</div><br><br> </div=
> </div> </blockquote></div>   </div></body></html>
---2130163251-1464586994-1349798995=:86558--