Mailing-List: contact mapreduce-user-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: mapreduce-user@hadoop.apache.org
Received-SPF: pass (nike.apache.org: domain of mapred.learn@gmail.com
 designates 209.85.210.48 as permitted sender)
DomainKey-Signature: a=rsa-sha1; c=nofws;
        d=gmail.com; s=gamma;
        h=mime-version:in-reply-to:references:date:message-id:subject:from:to
         :content-type;
        b=T8gDQZLIYMAwE9XuBW/X9KX6qwRn8FfCBOvVEy+CQSYuSx7foNhbiISMLyuJme1C22
         eTLa1afpSJIall9c/RhVD6yQAq9vMBUZBCLYz4urOpp4bsOuqmQrVGZAEii0VipYmPts
         PQ7vK4hA8i81XBY06nGvsJzz4PW4oBQt8BvsI=
MIME-Version: 1.0
In-Reply-To: <BANLkTim0OyeokFjfcoFxRWxj6ZYEUw=1yw@mail.gmail.com>
References: <BANLkTinmfy2C12zZF_CmJrTJSN1-4yqmLQ@mail.gmail.com>
	<BANLkTimHjmy7Uh9z40Ev84pgJbT80GwO1w@mail.gmail.com>
	<BANLkTinhVZ7=yA-9d445PwpcMioxhiaPKA@mail.gmail.com>
	<BANLkTim0OyeokFjfcoFxRWxj6ZYEUw=1yw@mail.gmail.com>
Date: Wed, 22 Jun 2011 11:57:47 -0700
Message-ID: <BANLkTik3ApgWYBiPL73Rf6hDMLJDkiAcxw@mail.gmail.com>
Subject: Re: how to get output files of fixed size in map-reduce job output
From: Mapred Learn <mapred.learn@gmail.com>
To: harsh@cloudera.com, mapreduce-user@hadoop.apache.org
Content-Type: multipart/alternative; boundary=000e0cd179c8638d9f04a6518b46

--000e0cd179c8638d9f04a6518b46
Content-Type: text/plain; charset=ISO-8859-1

problem with first option is that even if file is uploaded as 1 GB, then
also output is not 1 GB (it wud depend on compression). So, some runs need
to be done to estimate what size input file should be uploaded as to get 1
GB output.

For block size, I got your point. I think I said the same thing in terms of
file splits.

On Wed, Jun 22, 2011 at 11:46 AM, Harsh J <harsh@cloudera.com> wrote:

> CombineFileInputFormat should help with doing some locality, but it
> would not be as perfect as having the file loaded to the HDFS itself
> with a 1 GB block size (block sizes are per file properties, not
> global ones). You may consider that as an alternative approach.
>
> I do not get (ii). I meant by my last sentence the same thing I've
> explained just above here. If your block size is 64 MB, and your
> request splits of 1 GB (via plain FileInputFormat), then even the 64
> MB read can't be guaranteed local (theoretically speaking).
>
> On Thu, Jun 23, 2011 at 12:04 AM, Mapred Learn <mapred.learn@gmail.com>
> wrote:
> > Hi Harsh,
> > Thanks !
> > i) I was currently doing it by extending CombineFileInputFormat and
> > specifying -Dmapred.max.split.size but this increases job finish time by
> > about 3 times.
> > ii) since you said this file output size is going to be greater than
> block
> > size in this case. What happens in case when people have input split of
> say
> > 1 Gb and map-red output is produced as 400 MB. In this case also, size is
> > greater than block size ? Or did you mean that since mapper will get
> > multiple input files as input split, the data input to mapper won't be
> local
> > ?
> >
> > On Wed, Jun 22, 2011 at 11:26 AM, Harsh J <harsh@cloudera.com> wrote:
> >>
> >> Mapred,
> >>
> >> This should be doable if you are using TextInputFormat (or other
> >> FileInputFormat derivatives that do not override getSplits()
> >> behaviors).
> >>
> >> Try this:
> >> jobConf.setLong("mapred.min.split.size", <byte size you want each
> >> mapper split to try to contain, i.e. 1 GB in bytes (long)>);
> >>
> >> This would get you splits worth the size you mention, 1 GB or else,
> >> and you should have outputs fairly near to 1 GB when you do the
> >> sequence file conversion (lower at times due to serialization and
> >> compression being applied). You can play around with the parameter
> >> until the results are satisfactory.
> >>
> >> Note: Tasks would no longer be perfectly data local since you're
> >> requesting much > block size perhaps.
> >>
> >> On Wed, Jun 22, 2011 at 10:52 PM, Mapred Learn <mapred.learn@gmail.com>
> >> wrote:
> >> > I have a use case where I want to process data and generate seq file
> >> > output
> >> > of fixed size , say 1 GB i.e. each map-reduce job output should be 1
> Gb.
> >> >
> >> > Does anybody know of any -D option or any other way to achieve this ?
> >> >
> >> > -Thanks JJ
> >>
> >>
> >>
> >> --
> >> Harsh J
> >
> >
>
>
>
> --
> Harsh J
>

--000e0cd179c8638d9f04a6518b46
Content-Type: text/html; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

<div>problem with first option is that even if file is uploaded as 1 GB, th=
en also output is not 1 GB (it wud depend on=A0compression). So, some runs =
need to be done to estimate what size input file should be uploaded as to g=
et 1 GB output.</div>

<div>=A0</div>
<div>For block size, I got your point. I think I said the same thing=A0in t=
erms of file splits.<br><br></div>
<div class=3D"gmail_quote">On Wed, Jun 22, 2011 at 11:46 AM, Harsh J <span =
dir=3D"ltr">&lt;<a href=3D"mailto:harsh@cloudera.com">harsh@cloudera.com</a=
>&gt;</span> wrote:<br>
<blockquote style=3D"BORDER-LEFT: #ccc 1px solid; MARGIN: 0px 0px 0px 0.8ex=
; PADDING-LEFT: 1ex" class=3D"gmail_quote">CombineFileInputFormat should he=
lp with doing some locality, but it<br>would not be as perfect as having th=
e file loaded to the HDFS itself<br>
with a 1 GB block size (block sizes are per file properties, not<br>global =
ones). You may consider that as an alternative approach.<br><br>I do not ge=
t (ii). I meant by my last sentence the same thing I&#39;ve<br>explained ju=
st above here. If your block size is 64 MB, and your<br>
request splits of 1 GB (via plain FileInputFormat), then even the 64<br>MB =
read can&#39;t be guaranteed local (theoretically speaking).<br>
<div>
<div></div>
<div class=3D"h5"><br>On Thu, Jun 23, 2011 at 12:04 AM, Mapred Learn &lt;<a=
 href=3D"mailto:mapred.learn@gmail.com">mapred.learn@gmail.com</a>&gt; wrot=
e:<br>&gt; Hi Harsh,<br>&gt; Thanks !<br>&gt; i) I was currently doing it b=
y extending CombineFileInputFormat and<br>
&gt; specifying -Dmapred.max.split.size but this increases job finish time =
by<br>&gt; about 3 times.<br>&gt; ii) since you said this file output size =
is going to be greater than block<br>&gt; size in this case. What happens i=
n case when people have input split of say<br>
&gt; 1 Gb and map-red output is produced as 400 MB. In this case also, size=
 is<br>&gt; greater than block size ? Or did you mean that since mapper wil=
l get<br>&gt; multiple input files as input split, the data input to mapper=
 won&#39;t be local<br>
&gt; ?<br>&gt;<br>&gt; On Wed, Jun 22, 2011 at 11:26 AM, Harsh J &lt;<a hre=
f=3D"mailto:harsh@cloudera.com">harsh@cloudera.com</a>&gt; wrote:<br>&gt;&g=
t;<br>&gt;&gt; Mapred,<br>&gt;&gt;<br>&gt;&gt; This should be doable if you=
 are using TextInputFormat (or other<br>
&gt;&gt; FileInputFormat derivatives that do not override getSplits()<br>&g=
t;&gt; behaviors).<br>&gt;&gt;<br>&gt;&gt; Try this:<br>&gt;&gt; jobConf.se=
tLong(&quot;mapred.min.split.size&quot;, &lt;byte size you want each<br>
&gt;&gt; mapper split to try to contain, i.e. 1 GB in bytes (long)&gt;);<br=
>&gt;&gt;<br>&gt;&gt; This would get you splits worth the size you mention,=
 1 GB or else,<br>&gt;&gt; and you should have outputs fairly near to 1 GB =
when you do the<br>
&gt;&gt; sequence file conversion (lower at times due to serialization and<=
br>&gt;&gt; compression being applied). You can play around with the parame=
ter<br>&gt;&gt; until the results are satisfactory.<br>&gt;&gt;<br>&gt;&gt;=
 Note: Tasks would no longer be perfectly data local since you&#39;re<br>
&gt;&gt; requesting much &gt; block size perhaps.<br>&gt;&gt;<br>&gt;&gt; O=
n Wed, Jun 22, 2011 at 10:52 PM, Mapred Learn &lt;<a href=3D"mailto:mapred.=
learn@gmail.com">mapred.learn@gmail.com</a>&gt;<br>&gt;&gt; wrote:<br>&gt;&=
gt; &gt; I have a use case where I want to process data and generate seq fi=
le<br>
&gt;&gt; &gt; output<br>&gt;&gt; &gt; of fixed size , say 1 GB i.e. each ma=
p-reduce job output should be 1 Gb.<br>&gt;&gt; &gt;<br>&gt;&gt; &gt; Does =
anybody know of any -D option or any other way to achieve this ?<br>&gt;&gt=
; &gt;<br>
&gt;&gt; &gt; -Thanks JJ<br>&gt;&gt;<br>&gt;&gt;<br>&gt;&gt;<br>&gt;&gt; --=
<br>&gt;&gt; Harsh J<br>&gt;<br>&gt;<br><br><br><br></div></div>--<br><font=
 color=3D"#888888">Harsh J<br></font></blockquote></div><br>

--000e0cd179c8638d9f04a6518b46--