Return-Path: X-Original-To: apmail-hadoop-mapreduce-user-archive@minotaur.apache.org Delivered-To: apmail-hadoop-mapreduce-user-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 57A8294DE for ; Wed, 21 Dec 2011 00:54:47 +0000 (UTC) Received: (qmail 27659 invoked by uid 500); 21 Dec 2011 00:54:46 -0000 Delivered-To: apmail-hadoop-mapreduce-user-archive@hadoop.apache.org Received: (qmail 27567 invoked by uid 500); 21 Dec 2011 00:54:46 -0000 Mailing-List: contact mapreduce-user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: mapreduce-user@hadoop.apache.org Delivered-To: mailing list mapreduce-user@hadoop.apache.org Received: (qmail 27556 invoked by uid 99); 21 Dec 2011 00:54:46 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 21 Dec 2011 00:54:46 +0000 X-ASF-Spam-Status: No, hits=-0.0 required=5.0 tests=SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of todd@cloudera.com designates 209.85.215.48 as permitted sender) Received: from [209.85.215.48] (HELO mail-lpp01m010-f48.google.com) (209.85.215.48) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 21 Dec 2011 00:54:40 +0000 Received: by laam7 with SMTP id m7so3909746laa.35 for ; Tue, 20 Dec 2011 16:54:18 -0800 (PST) Received: by 10.152.102.136 with SMTP id fo8mr4142150lab.30.1324428858103; Tue, 20 Dec 2011 16:54:18 -0800 (PST) MIME-Version: 1.0 Received: by 10.152.113.162 with HTTP; Tue, 20 Dec 2011 16:53:57 -0800 (PST) In-Reply-To: References: From: Todd Lipcon Date: Tue, 20 Dec 2011 16:53:57 -0800 Message-ID: Subject: Re: Performance of direct vs indirect shuffling To: mapreduce-user@hadoop.apache.org Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable The advantages of the "pull" based shuffle is fault tolerance - if you shuffle to the reducer and then the reducer dies, you have to rerun *all* of the earlier maps in the "push" model. The advantage of writing to disk is of course that you can have more intermediate output than fits in RAM. In practice, for short jobs, the output might stay entirely in buffer cache and never actually hit disk (RHEL by default configures the writeback period to 30 seconds when there isn't page cache pressure). One possible optimization I hope to look into next year is to change the map output code to push the data to the local TT, which would have configurable in-memory buffers. Only once those overflow would they flush to disk. Compared to just using buffer cache, this has the advantage that it won't _ever_ writeback unless it has to for space consumption reasons, and is more predictable to manage. My guess is we could squeeze some performance here but not tons. -Todd On Tue, Dec 20, 2011 at 3:55 PM, Kevin Burton wrote: > The current hadoop implementation shuffles directly to disk and then thos= e > disk files are eventually requested by the target nodes which are > responsible for doing the reduce() on the intermediate data. > > However, this requires more 2x IO than strictly necessary. > > If the data were instead shuffled DIRECTLY to the target host, this IO > overhead would be removed. > > I believe that any benefits from writing locally (compressing, combining) > and then doing a transfer can be had by simply allocating a buffer and (s= ay > 250-500MB per map task) and then transfering data directly.=A0 I don't th= ink > that the savings will be 100% on par with first writing locally but remem= ber > it's already 2x faster by not having to write to disk... so any advantage= s > to first shuffling to the local disk would have to be more than 100% fast= er. > > However, writing data to the local disk first could in theory had some > practical advantages under certain loads.=A0 I just don't think they're > practical and that direct shuffling is superior. > > Anyone have any thoughts here? --=20 Todd Lipcon Software Engineer, Cloudera