Return-Path: X-Original-To: apmail-hadoop-common-user-archive@www.apache.org Delivered-To: apmail-hadoop-common-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 99096EE14 for ; Fri, 15 Feb 2013 21:07:47 +0000 (UTC) Received: (qmail 85925 invoked by uid 500); 15 Feb 2013 21:07:42 -0000 Delivered-To: apmail-hadoop-common-user-archive@hadoop.apache.org Received: (qmail 85845 invoked by uid 500); 15 Feb 2013 21:07:42 -0000 Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hadoop.apache.org Delivered-To: mailing list user@hadoop.apache.org Received: (qmail 85837 invoked by uid 99); 15 Feb 2013 21:07:42 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 15 Feb 2013 21:07:42 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of sandy.ryza@cloudera.com designates 209.85.220.179 as permitted sender) Received: from [209.85.220.179] (HELO mail-vc0-f179.google.com) (209.85.220.179) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 15 Feb 2013 21:07:37 +0000 Received: by mail-vc0-f179.google.com with SMTP id gb23so2472171vcb.10 for ; Fri, 15 Feb 2013 13:07:16 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20120113; h=mime-version:x-received:in-reply-to:references:date:message-id :subject:from:to:content-type:x-gm-message-state; bh=18hfwlYMi3p6iNzcIuSMLKwexjsezfpI5KE4/W3852E=; b=RWh3FqTnSZIWtPgO4y4VTH2lj3YiVtGSvpRfSx2bYPoDLvwxytRHSjDZLTykHGJpFH m4yV6qaiytsd5a23Oy4eiNfLfZAG/7oeu0zJXJqS9GmsDdARfOffq8jlo+b3I8Z6ex83 aAQXJl+Jr92OoFchlPqXYWiUq08rUvWq77FiqDApFeZw6k63iOE4ioD9x/BrXshcd3VK hZvlWvdkuHZ6EqCTnz2XN5ElkJO2Kc/yrof1KwpuvE/V4bnCc1/jnSucL1QHq5pSuBds 3hh5l1cjjdcxJ/ABMMHkIQH4Ty1Hsvp57dDfzqbUZpQ3OZF0ryH0FEj88LWHGpmqZPyI /R7A== MIME-Version: 1.0 X-Received: by 10.220.223.202 with SMTP id il10mr5296542vcb.4.1360962436429; Fri, 15 Feb 2013 13:07:16 -0800 (PST) Received: by 10.220.5.211 with HTTP; Fri, 15 Feb 2013 13:07:16 -0800 (PST) In-Reply-To: References: Date: Fri, 15 Feb 2013 13:07:16 -0800 Message-ID: Subject: Re: Sorting huge text files in Hadoop From: Sandy Ryza To: user@hadoop.apache.org Content-Type: multipart/alternative; boundary=14dae9cdc48793770b04d5c9c2d4 X-Gm-Message-State: ALoCoQknyG5rESM1rs9xy9a/Qpfd6qVVfUbW2z7HPg8AlXO3xd5NqW5q9AclFbJpsOzBTPOl6K11 X-Virus-Checked: Checked by ClamAV on apache.org --14dae9cdc48793770b04d5c9c2d4 Content-Type: text/plain; charset=ISO-8859-1 A map-only job does not result in the standard shuffle-sort. Map outputs are written directly to HDFS. -Sandy On Fri, Feb 15, 2013 at 12:23 PM, Jay Vyas wrote: > Maybe im mistaken about what is meant by map-only. Does a map-only job > still result in standard shuffle-sort ? Or does that get cut short? > > hmmm i think I see what you mean, i guess a map-only sort is possible as > long as you use a custom partitioner and you let the shuffle/sort run to > completion. > > i think the shuffle/sort, if you use a partitioner that partitions the > sorting in order (i.e. part-0 is all lines starting with "a", part-1 is all > starting with "b", etc...), > does still run inspite of the fact that your not running reducers. > > > > > On Fri, Feb 15, 2013 at 3:09 PM, Michael Segel wrote: > >> Why do you need a 1TB block? >> >> On Feb 15, 2013, at 1:29 PM, Jay Vyas wrote: >> >> well.. ok... i guess you could have a 1TB block do an in place sort on >> the file, write it to a tmp directory, and then spill the records in order >> or something. at that point might as well not use hadoop. >> >> >> Michael Segel | (m) 312.755.9623**** >> >> Segel and Associates**** >> >> > > > -- > Jay Vyas > http://jayunit100.blogspot.com > --14dae9cdc48793770b04d5c9c2d4 Content-Type: text/html; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable A map-only job does not result in the standard shuffle-sort. =A0Map outputs= are written directly to HDFS.

-Sandy

On Fri, Feb 15, 2013 at 12:23 PM, Jay Vyas <jayunit100= @gmail.com> wrote:
Maybe im mistaken abou= t what is meant by map-only.=A0 Does a map-only job still result in standar= d shuffle-sort ?=A0 Or does that get cut short?

hmmm i think I see what you mean, i guess a map-only sort is possible a= s long as you use a custom partitioner and you let the shuffle/sort run to = completion.=A0

i think the shuffle/sort, if you use a partitioner that partitions the = sorting in order (i.e. part-0 is all lines starting with "a", par= t-1 is all starting with "b", etc...),
does still run i= nspite of the fact that your not running reducers.=A0



On Fri, Feb 15, 2013 at 3:09 PM, Michael Segel = <michael_segel@hotmail.com> wrote:
Why do y= ou need a 1TB block?=A0

On Feb 15, 2013, at 1:29 PM,= Jay Vyas <jay= unit100@gmail.com> wrote:

well.. ok... i guess you cou= ld have a 1TB block do an in place sort on the file, write it to a tmp dire= ctory, and then spill the records in order or something.=A0 at that point m= ight as well not use hadoop.

Michael Segel=A0=A0| (m) 312.755= .9623

Seg= el and Associates





--
Jay Vyas
http://jayunit100.blogspo= t.com

--14dae9cdc48793770b04d5c9c2d4--