Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@hadoop.apache.org
Received-SPF: pass (athena.apache.org: domain of sandy.ryza@cloudera.com
 designates 209.85.220.179 as permitted sender)
MIME-Version: 1.0
In-Reply-To: 
 <CAAu13zGXGY_xTxQ=GBjsfP87ZWpoTiDTUKVOiGLBxEOHHDkMyg@mail.gmail.com>
References: 
 <CAMy=5ocNYAaQ3BuKvT5Ejye9yGaFb5TEbg-ymuyB_fvBVwisfQ@mail.gmail.com>
	<CAAu13zH3d2=DYW-NKotkEoGxJAp63E34FOuzgpKaZBiDH2WKsQ@mail.gmail.com>
	<BLU0-SMTP314DB57FDADF96540F1BCF98F0E0@phx.gbl>
	<CAAu13zEOp=Ngc=WFSpvmLm2C3AdsN83JEcXcZ98ZW9OxLBoY-A@mail.gmail.com>
	<BLU0-SMTP6DAB522991C23A008ED368F0E0@phx.gbl>
	<CAAu13zGXGY_xTxQ=GBjsfP87ZWpoTiDTUKVOiGLBxEOHHDkMyg@mail.gmail.com>
Date: Fri, 15 Feb 2013 13:07:16 -0800
Message-ID: 
 <CACBYxK+Q8wWc3QU6PSKhGkm8-6+RshAKbqKvsd3aOGS30EGLmQ@mail.gmail.com>
Subject: Re: Sorting huge text files in Hadoop
From: Sandy Ryza <sandy.ryza@cloudera.com>
To: user@hadoop.apache.org
Content-Type: multipart/alternative; boundary=14dae9cdc48793770b04d5c9c2d4

--14dae9cdc48793770b04d5c9c2d4
Content-Type: text/plain; charset=ISO-8859-1

A map-only job does not result in the standard shuffle-sort.  Map outputs
are written directly to HDFS.

-Sandy

On Fri, Feb 15, 2013 at 12:23 PM, Jay Vyas <jayunit100@gmail.com> wrote:

> Maybe im mistaken about what is meant by map-only.  Does a map-only job
> still result in standard shuffle-sort ?  Or does that get cut short?
>
> hmmm i think I see what you mean, i guess a map-only sort is possible as
> long as you use a custom partitioner and you let the shuffle/sort run to
> completion.
>
> i think the shuffle/sort, if you use a partitioner that partitions the
> sorting in order (i.e. part-0 is all lines starting with "a", part-1 is all
> starting with "b", etc...),
> does still run inspite of the fact that your not running reducers.
>
>
>
>
> On Fri, Feb 15, 2013 at 3:09 PM, Michael Segel <michael_segel@hotmail.com>wrote:
>
>> Why do you need a 1TB block?
>>
>> On Feb 15, 2013, at 1:29 PM, Jay Vyas <jayunit100@gmail.com> wrote:
>>
>> well.. ok... i guess you could have a 1TB block do an in place sort on
>> the file, write it to a tmp directory, and then spill the records in order
>> or something.  at that point might as well not use hadoop.
>>
>>
>> Michael Segel  <msegel@segel.com> | (m) 312.755.9623****
>>
>> Segel and Associates****
>>
>>
>
>
> --
> Jay Vyas
> http://jayunit100.blogspot.com
>

--14dae9cdc48793770b04d5c9c2d4
Content-Type: text/html; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

A map-only job does not result in the standard shuffle-sort. =A0Map outputs=
 are written directly to HDFS.<div><br></div><div>-Sandy<br><br><div class=
=3D"gmail_quote">On Fri, Feb 15, 2013 at 12:23 PM, Jay Vyas <span dir=3D"lt=
r">&lt;<a href=3D"mailto:jayunit100@gmail.com" target=3D"_blank">jayunit100=
@gmail.com</a>&gt;</span> wrote:<br>
<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p=
x #ccc solid;padding-left:1ex"><div dir=3D"ltr"><div>Maybe im mistaken abou=
t what is meant by map-only.=A0 Does a map-only job still result in standar=
d shuffle-sort ?=A0 Or does that get cut short?<br>
<br>hmmm i think I see what you mean, i guess a map-only sort is possible a=
s long as you use a custom partitioner and you let the shuffle/sort run to =
completion.=A0 <br>
<br>i think the shuffle/sort, if you use a partitioner that partitions the =
sorting in order (i.e. part-0 is all lines starting with &quot;a&quot;, par=
t-1 is all starting with &quot;b&quot;, etc...), <br></div>does still run i=
nspite of the fact that your not running reducers.=A0 <br>

<br><br><div><div><div class=3D"gmail_extra"><div><div class=3D"h5"><br><br=
><div class=3D"gmail_quote">On Fri, Feb 15, 2013 at 3:09 PM, Michael Segel =
<span dir=3D"ltr">&lt;<a href=3D"mailto:michael_segel@hotmail.com" target=
=3D"_blank">michael_segel@hotmail.com</a>&gt;</span> wrote:<br>

<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p=
x #ccc solid;padding-left:1ex"><div style=3D"word-wrap:break-word">Why do y=
ou need a 1TB block?=A0<div><div><br><div><div>On Feb 15, 2013, at 1:29 PM,=
 Jay Vyas &lt;<a href=3D"mailto:jayunit100@gmail.com" target=3D"_blank">jay=
unit100@gmail.com</a>&gt; wrote:</div>

<br><blockquote type=3D"cite"><div dir=3D"ltr">well.. ok... i guess you cou=
ld have a 1TB block do an in place sort on the file, write it to a tmp dire=
ctory, and then spill the records in order or something.=A0 at that point m=
ight as well not use hadoop.<br>


</div>
</blockquote></div><br></div><div>
<p class=3D"MsoNormal"><span style=3D"font-size:9pt;font-family:Arial"><a h=
ref=3D"mailto:msegel@segel.com" target=3D"_blank">Michael Segel=A0</a></spa=
n><span>=A0</span><span style=3D"font-size:13pt;font-family:Arial">| (m) <a=
 href=3D"tel:312.755.9623" value=3D"+13127559623" target=3D"_blank">312.755=
.9623</a><u></u><u></u></span></p>

<p class=3D"MsoNormal"><span style=3D"font-size:13pt;font-family:Arial">Seg=
el and Associates<u></u><u></u></span></p>

</div>
<br></div></div></blockquote></div><br><br clear=3D"all"><br></div></div><s=
pan class=3D"HOEnZb"><font color=3D"#888888">-- <br>Jay Vyas<br><a href=3D"=
http://jayunit100.blogspot.com" target=3D"_blank">http://jayunit100.blogspo=
t.com</a>
</font></span></div></div></div></div>
</blockquote></div><br></div>

--14dae9cdc48793770b04d5c9c2d4--