Mailing-List: contact mapreduce-user-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: mapreduce-user@hadoop.apache.org
Received-SPF: pass (athena.apache.org: domain of todd@cloudera.com designates
 209.85.215.48 as permitted sender)
MIME-Version: 1.0
In-Reply-To: 
 <CAAZU44mmrWFqVZWyJM-CTa1Z8xOS66BUUiO69MPDL5K-re95sQ@mail.gmail.com>
References: 
 <CAAZU44mmrWFqVZWyJM-CTa1Z8xOS66BUUiO69MPDL5K-re95sQ@mail.gmail.com>
From: Todd Lipcon <todd@cloudera.com>
Date: Tue, 20 Dec 2011 16:53:57 -0800
Message-ID: 
 <CADY20s5GQ6DfKWVHg3bJO90zs2aYTsLGU=yoNUyfdJJEiYyEQg@mail.gmail.com>
Subject: Re: Performance of direct vs indirect shuffling
To: mapreduce-user@hadoop.apache.org
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

The advantages of the "pull" based shuffle is fault tolerance - if you
shuffle to the reducer and then the reducer dies, you have to rerun
*all* of the earlier maps in the "push" model.

The advantage of writing to disk is of course that you can have more
intermediate output than fits in RAM.

In practice, for short jobs, the output might stay entirely in buffer
cache and never actually hit disk (RHEL by default configures the
writeback period to 30 seconds when there isn't page cache pressure).

One possible optimization I hope to look into next year is to change
the map output code to push the data to the local TT, which would have
configurable in-memory buffers. Only once those overflow would they
flush to disk. Compared to just using buffer cache, this has the
advantage that it won't _ever_ writeback unless it has to for space
consumption reasons, and is more predictable to manage. My guess is we
could squeeze some performance here but not tons.

-Todd

On Tue, Dec 20, 2011 at 3:55 PM, Kevin Burton <burtonator@gmail.com> wrote:
> The current hadoop implementation shuffles directly to disk and then thos=
e
> disk files are eventually requested by the target nodes which are
> responsible for doing the reduce() on the intermediate data.
>
> However, this requires more 2x IO than strictly necessary.
>
> If the data were instead shuffled DIRECTLY to the target host, this IO
> overhead would be removed.
>
> I believe that any benefits from writing locally (compressing, combining)
> and then doing a transfer can be had by simply allocating a buffer and (s=
ay
> 250-500MB per map task) and then transfering data directly.=A0 I don't th=
ink
> that the savings will be 100% on par with first writing locally but remem=
ber
> it's already 2x faster by not having to write to disk... so any advantage=
s
> to first shuffling to the local disk would have to be more than 100% fast=
er.
>
> However, writing data to the local disk first could in theory had some
> practical advantages under certain loads.=A0 I just don't think they're
> practical and that direct shuffling is superior.
>
> Anyone have any thoughts here?


--=20
Todd Lipcon
Software Engineer, Cloudera