Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@hadoop.apache.org
Received-SPF: pass (nike.apache.org: domain of annalahoud@gmail.com designates
 209.85.220.176 as permitted sender)
MIME-Version: 1.0
In-Reply-To: <1349798995.86558.YahooMailNeo@web160702.mail.bf1.yahoo.com>
References: 
 <CAOeeJfgeZdRJw+D0F=sY-eexwCM1wM+OroUqJtupsr-WiES1Dw@mail.gmail.com>
	<CABCYYb-kZcougMSDdZqgsG2EEgXXkHq63A6qpP-cvuKB1PeD7A@mail.gmail.com>
	<392626559-1349151816-cardhu_decombobulator_blackberry.rim.net-1796317408-@b3.c16.bise7.blackberry>
	<CAOeeJfhBcq6YN5kHYKeQU67p3rfCnQ49s5ecf=rYYgZzTJfUjg@mail.gmail.com>
	<1349189521.30522.YahooMailNeo@web160703.mail.bf1.yahoo.com>
	<CAOeeJfhH0KXmKz+Vv+SaXdfsVHTLuqy5RLrAePMpHHVXg5Ck0Q@mail.gmail.com>
	<1349798995.86558.YahooMailNeo@web160702.mail.bf1.yahoo.com>
Date: Tue, 9 Oct 2012 12:28:14 -0400
Message-ID: 
 <CAOeeJfiXnm+Uu2xyn45Oz=0VBo1nZb_592BbHgL9tUqwoyCW6Q@mail.gmail.com>
Subject: Re: File block size use
From: Anna Lahoud <annalahoud@gmail.com>
To: Raj Vishwanathan <rajvish@yahoo.com>
Cc: "user@hadoop.apache.org" <user@hadoop.apache.org>
Content-Type: multipart/alternative; boundary=bcaec50162bd244d6f04cba2d33a

--bcaec50162bd244d6f04cba2d33a
Content-Type: text/plain; charset=ISO-8859-1

You are correct that I want to create a small number of large files from a
large number of small files. The only solution that has worked, as you say,
has been a custom M/R job. Thank you for the help and ideas.

On Tue, Oct 9, 2012 at 12:09 PM, Raj Vishwanathan <rajvish@yahoo.com> wrote:

> Anna
>
> I misunderstood your problem. I thought you wanted to change the block
> size of every file. I didn' t realize that you were aggregating multiple
> small files into different, albeit smaller, set of larger files of a bigger
> block size
> to improve performance.
>
> I think as Chris suggested you need to have a custom M/R job or you could
> probably get away with some scripting magic :-)
>
> Raj
>
>   ------------------------------
> *From:* Anna Lahoud <annalahoud@gmail.com>
> *To:* user@hadoop.apache.org; Raj Vishwanathan <rajvish@yahoo.com>
> *Sent:* Tuesday, October 9, 2012 7:01 AM
>
> *Subject:* Re: File block size use
>
> Raj - I was not able to get this to work either.
>
> On Tue, Oct 2, 2012 at 10:52 AM, Raj Vishwanathan <rajvish@yahoo.com>wrote:
>
> I haven't tried it but this should also work
>
>  hadoop  fs  -Ddfs.block.size=<NEW BLOCK SIZE> -cp  src dest
>
> Raj
>
>   ------------------------------
> *From:* Anna Lahoud <annalahoud@gmail.com>
> *To:* user@hadoop.apache.org; bejoy.hadoop@gmail.com
> *Sent:* Tuesday, October 2, 2012 7:17 AM
>
> *Subject:* Re: File block size use
>
> Thank you. I will try today.
>
> On Tue, Oct 2, 2012 at 12:23 AM, Bejoy KS <bejoy.hadoop@gmail.com> wrote:
>
> **
> Hi Anna
>
> If you want to increase the block size of existing files. You can use a
> Identity Mapper with no reducer. Set the min and max split sizes to your
> requirement (512Mb). Use SequenceFileInputFormat and
> SequenceFileOutputFormat for your job.
> Your job should be done.
>
> Regards
> Bejoy KS
>
> Sent from handheld, please excuse typos.
> ------------------------------
> *From: * Chris Nauroth <cnauroth@hortonworks.com>
> *Date: *Mon, 1 Oct 2012 21:12:58 -0700
> *To: *<user@hadoop.apache.org>
> *ReplyTo: * user@hadoop.apache.org
> *Subject: *Re: File block size use
>
> Hello Anna,
>
> If I understand correctly, you have a set of multiple sequence files, each
> much smaller than the desired block size, and you want to concatenate them
> into a set of fewer files, each one more closely aligned to your desired
> block size.  Presumably, the goal is to improve throughput of map reduce
> jobs using those files as input by running fewer map tasks, reading a
> larger number of input records.
>
> Whenever I've had this kind of requirement, I've run a custom map reduce
> job to implement the file consolidation.  In my case, I was typically
> working with TextInputFormat (not sequence files).  I used IdentityMapper
> and a custom reducer that passed through all values but with key set to
> NullWritable, because the keys (input file offsets in the case of
> TextInputFormat) were not valuable data.  For my input data, this was
> sufficient to achieve fairly even distribution of data across the reducer
> tasks, and I could reasonably predict the input data set size, so I could
> reasonably set the number of reducers and get decent results.  (This may or
> may not be true for your data set though.)
>
> A weakness of this approach is that the keys must pass from the map tasks
> to the reduce tasks, only to get discarded before writing the final output.
>  Also, the distribution of input records to reduce tasks is not truly
> random, and therefore the reduce output files may be uneven in size.  This
> could be solved by writing NullWritable keys out of the map task instead of
> the reduce task and writing a custom implementation of Partitioner to
> distribute them randomly.
>
> To expand on this idea, it could be possible to inspect the FileStatus of
> each input, sum the values of FileStatus.getLen(), and then use that
> information to make a decision about how many reducers to run (and
> therefore approximately set a target output file size).  I'm not aware of
> any built-in or external utilities that do this for you though.
>
> Hope this helps,
> --Chris
>
> On Mon, Oct 1, 2012 at 11:30 AM, Anna Lahoud <annalahoud@gmail.com> wrote:
>
> I would like to be able to resize a set of inputs, already in SequenceFile
> format, to be larger.
>
> I have tried 'hadoop distcp -Ddfs.block.size=$[64*1024*1024]' and did not
> get what I expected. The outputs were exactly the same as the inputs.
>
> I also tried running a job with an IdentityMapper and IdentityReducer.
> Although that approaches a better solution, it still requires that I know
> in advance how many reducers I need to get better file sizes.
>
> I was looking at the SequenceFile.Writer constructors and noticed that
> there are block size parameters that can be used. Using a writer
> constructed with a 512MB block size, there is nothing that splits the
> output and I simply get a single file the size of my inputs.
>
> What is the current standard for combining sequence files to create larger
> files for map-reduce jobs? I have seen code that tracks what it writes into
> the file, but that seems like the long version. I am hoping there is a
> shorter path.
>
> Thank you.
>
> Anna
>
>
>
>
>
>
>
>
>

--bcaec50162bd244d6f04cba2d33a
Content-Type: text/html; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

You are correct that I want to create a small number of large files from a =
large number of small files. The only solution that has worked, as you say,=
 has been a custom M/R job. Thank you for the help and ideas.<br><br><div c=
lass=3D"gmail_quote">
On Tue, Oct 9, 2012 at 12:09 PM, Raj Vishwanathan <span dir=3D"ltr">&lt;<a =
href=3D"mailto:rajvish@yahoo.com" target=3D"_blank">rajvish@yahoo.com</a>&g=
t;</span> wrote:<br><blockquote class=3D"gmail_quote" style=3D"margin:0 0 0=
 .8ex;border-left:1px #ccc solid;padding-left:1ex">
<div><div style=3D"font-size:10pt;font-family:arial,helvetica,sans-serif"><=
div><span>Anna</span></div><div style=3D"font-style:normal;font-size:13.600=
000381469727px;background-color:transparent;font-family:arial,helvetica,san=
s-serif">
<span><br></span></div><div style=3D"font-style:normal;font-size:13.6000003=
81469727px;background-color:transparent;font-family:arial,helvetica,sans-se=
rif"><span>I misunderstood your problem. I thought you wanted to change the=
 block size of every file. I didn&#39; t realize that you were aggregating =
multiple small files into different, albeit smaller, set of larger files of=
 a bigger block size=A0</span></div>
<div style=3D"font-style:normal;font-size:13.600000381469727px;background-c=
olor:transparent;font-family:arial,helvetica,sans-serif"><span>to improve p=
erformance.=A0</span></div><div style=3D"font-style:normal;font-size:13.600=
000381469727px;background-color:transparent;font-family:arial,helvetica,san=
s-serif">
<span><br></span></div><div style=3D"font-style:normal;font-size:13.6000003=
81469727px;background-color:transparent;font-family:arial,helvetica,sans-se=
rif"><span>I think as Chris suggested you need to have a custom M/R job or =
you could probably get away with some scripting magic :-)</span></div>
<div style=3D"font-style:normal;font-size:13.600000381469727px;background-c=
olor:transparent;font-family:arial,helvetica,sans-serif"><span><br></span><=
/div><div style=3D"font-style:normal;font-size:13.600000381469727px;backgro=
und-color:transparent;font-family:arial,helvetica,sans-serif">
<span>Raj</span></div><div><br><blockquote style=3D"border-left:2px solid r=
gb(16,16,255);margin-left:5px;margin-top:5px;padding-left:5px">  <div style=
=3D"font-family:arial,helvetica,sans-serif;font-size:10pt"> <div style=3D"f=
ont-family:&#39;times new roman&#39;,&#39;new york&#39;,times,serif;font-si=
ze:12pt">
 <div dir=3D"ltr"> <font face=3D"Arial"><div class=3D"im"> <hr size=3D"1"> =
 <b><span style=3D"font-weight:bold">From:</span></b> Anna Lahoud &lt;<a hr=
ef=3D"mailto:annalahoud@gmail.com" target=3D"_blank">annalahoud@gmail.com</=
a>&gt;<br>
 </div><b><span style=3D"font-weight:bold">To:</span></b> <a href=3D"mailto=
:user@hadoop.apache.org" target=3D"_blank">user@hadoop.apache.org</a>; Raj =
Vishwanathan &lt;<a href=3D"mailto:rajvish@yahoo.com" target=3D"_blank">raj=
vish@yahoo.com</a>&gt; <br>
 <b><span style=3D"font-weight:bold">Sent:</span></b> Tuesday, October 9, 2=
012 7:01 AM<div><div class=3D"h5"><br> <b><span style=3D"font-weight:bold">=
Subject:</span></b> Re: File block size use<br> </div></div></font> </div><=
div>
<div class=3D"h5"> <br>
<div>Raj - I was not able to get this to work either. <br><br><div>On Tue, =
Oct 2, 2012 at 10:52 AM, Raj Vishwanathan <span dir=3D"ltr">&lt;<a rel=3D"n=
ofollow" href=3D"mailto:rajvish@yahoo.com" target=3D"_blank">rajvish@yahoo.=
com</a>&gt;</span> wrote:<br>

<blockquote style=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-l=
eft:1ex"><div><div style=3D"font-size:10pt;font-family:arial,helvetica,sans=
-serif"><div style=3D"font-family:arial,helvetica,sans-serif;font-size:10pt=
">

I haven&#39;t tried it but this should also work</div><div style=3D"font-fa=
mily:arial,helvetica,sans-serif;font-size:10pt"><br></div><div style=3D"bac=
kground-color:transparent"><font>=A0hadoop =A0fs =A0-Ddfs.block.size=3D&lt;=
NEW BLOCK SIZE&gt; -cp =A0src dest</font><br>

</div><div style=3D"font-style:normal;font-size:16.363636016845703px;backgr=
ound-color:transparent;font-family:arial,helvetica,sans-serif"><font><br></=
font></div><div style=3D"font-family:arial,helvetica,sans-serif;font-size:1=
0pt">

Raj</div><div style=3D"font-family:arial,helvetica,sans-serif;font-size:10p=
t"><br><blockquote style=3D"border-left:2px solid rgb(16,16,255);margin-lef=
t:5px;margin-top:5px;padding-left:5px">  <div style=3D"font-family:arial,he=
lvetica,sans-serif;font-size:10pt">

 <div style=3D"font-size:12pt"> <div dir=3D"ltr"> <font face=3D"Arial"> <hr=
 size=3D"1">  <b><span style=3D"font-weight:bold">From:</span></b> Anna Lah=
oud &lt;<a rel=3D"nofollow" href=3D"mailto:annalahoud@gmail.com" target=3D"=
_blank">annalahoud@gmail.com</a>&gt;<br>

 <b><span style=3D"font-weight:bold">To:</span></b> <a rel=3D"nofollow" hre=
f=3D"mailto:user@hadoop.apache.org" target=3D"_blank">user@hadoop.apache.or=
g</a>; <a rel=3D"nofollow" href=3D"mailto:bejoy.hadoop@gmail.com" target=3D=
"_blank">bejoy.hadoop@gmail.com</a> <br>

 <b><span style=3D"font-weight:bold">Sent:</span></b> Tuesday, October 2, 2=
012 7:17 AM<div><div><br> <b><span style=3D"font-weight:bold">Subject:</spa=
n></b> Re: File block size use<br> </div></div></font> </div><div>
<div> <br>
<div>Thank you. I will try today.<br><br><div>On Tue, Oct 2, 2012 at 12:23 =
AM, Bejoy KS <span dir=3D"ltr">&lt;<a rel=3D"nofollow" href=3D"mailto:bejoy=
.hadoop@gmail.com" target=3D"_blank">bejoy.hadoop@gmail.com</a>&gt;</span> =
wrote:<br>


<blockquote style=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-l=
eft:1ex"><u></u><div>Hi Anna<br><br>If you want to increase the block size =
of existing files. You can use a Identity Mapper with no reducer.  Set the =
min and max split sizes to your requirement (512Mb). Use SequenceFileInputF=
ormat and SequenceFileOutputFormat for your job.<br>


Your job should be done.<br><br> <div>Regards<br>Bejoy KS<br><br>Sent from =
handheld, please excuse typos.</div><hr><div><b>From: </b> Chris Nauroth &l=
t;<a rel=3D"nofollow" href=3D"mailto:cnauroth@hortonworks.com" target=3D"_b=
lank">cnauroth@hortonworks.com</a>&gt;
</div><div><b>Date: </b>Mon, 1 Oct 2012 21:12:58 -0700</div><div><b>To: </b=
>&lt;<a rel=3D"nofollow" href=3D"mailto:user@hadoop.apache.org" target=3D"_=
blank">user@hadoop.apache.org</a>&gt;</div><div><b>ReplyTo: </b> <a rel=3D"=
nofollow" href=3D"mailto:user@hadoop.apache.org" target=3D"_blank">user@had=
oop.apache.org</a>
</div><div><b>Subject: </b>Re: File block size use</div><div><div><div><br>=
</div>Hello Anna,<div><br></div><div>If I understand correctly, you have a =
set of multiple sequence files, each much smaller than the desired block si=
ze, and you want to concatenate them into a set of fewer files, each one mo=
re closely aligned to your desired block size. =A0Presumably, the goal is t=
o improve throughput of map reduce jobs using those files as input by runni=
ng fewer map tasks, reading a larger number of input records.</div>


<div><br></div><div>Whenever I&#39;ve had this kind of requirement, I&#39;v=
e run a custom map reduce job to implement the file consolidation. =A0In my=
 case, I was typically working with TextInputFormat (not sequence files). =
=A0I used IdentityMapper and a custom reducer that passed through all value=
s but with key set to NullWritable, because the keys (input file offsets in=
 the case of TextInputFormat) were not valuable data. =A0For my input data,=
 this was sufficient to achieve fairly even distribution of data across the=
 reducer tasks, and I could reasonably predict the input data set size, so =
I could reasonably set the number of reducers and get decent results. =A0(T=
his may or may not be true for your data set though.)</div>


<div><br></div><div>A weakness of this approach is that the keys must pass =
from the map tasks to the reduce tasks, only to get discarded before writin=
g the final output. =A0Also, the distribution of input records to reduce ta=
sks is not truly random, and therefore the reduce output files may be uneve=
n in size. =A0This could be solved by writing NullWritable keys out of the =
map task instead of the reduce task and writing a custom implementation of =
Partitioner to distribute them randomly.</div>


<div><br></div><div>To expand on this idea, it could be possible to inspect=
 the FileStatus of each input, sum the values of FileStatus.getLen(), and t=
hen use that information to make a decision about how many reducers to run =
(and therefore approximately set a target output file size). =A0I&#39;m not=
 aware of any built-in or external utilities that do this for you though.</=
div>


<div><br></div><div>Hope this helps,</div><div>--Chris</div><div><br></div>=
<div><div>On Mon, Oct 1, 2012 at 11:30 AM, Anna Lahoud <span dir=3D"ltr">&l=
t;<a rel=3D"nofollow" href=3D"mailto:annalahoud@gmail.com" target=3D"_blank=
">annalahoud@gmail.com</a>&gt;</span> wrote:<br>


<blockquote style=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-l=
eft:1ex">I would like to be able to resize a set of inputs, already in Sequ=
enceFile format, to be larger. <br><br>I have tried &#39;hadoop distcp -Ddf=
s.block.size=3D$[64*1024*1024]&#39; and did not get what I expected. The ou=
tputs were exactly the same as the inputs. <br>


<br>I also tried running a job with an IdentityMapper and IdentityReducer. =
Although that approaches a better solution, it still requires that I know i=
n advance how many reducers I need to get better file sizes. <br><br>I was =
looking at the SequenceFile.Writer constructors and noticed that there are =
block size parameters that can be used. Using a writer constructed with a 5=
12MB block size, there is nothing that splits the output and I simply get a=
 single file the size of my inputs. <br>


<br>What is the current standard for combining sequence files to create lar=
ger files for map-reduce jobs? I have seen code that tracks what it writes =
into the file, but that seems like the long version. I am hoping there is a=
 shorter path.<br>


<br>Thank you.<span><font color=3D"#888888"><br><br>Anna<br><br>
</font></span></blockquote></div><br></div>

</div></div></div></blockquote></div><br>
</div><br><br> </div></div></div> </div> </blockquote></div>   </div></div>=
</blockquote></div><br>
</div><br><br> </div></div></div> </div> </blockquote></div>   </div></div>=
</blockquote></div><br>

--bcaec50162bd244d6f04cba2d33a--