hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Nathan Marz <nathan.m...@gmail.com>
Subject Re: Mathematics behind Hadoop-based systems
Date Wed, 06 Jan 2010 06:45:01 GMT
That's a great way of putting it: "increasing capacity shifts the
equilibrium". If you work some examples you'll find that it doesn't take
many iterations for a workflow to converge to its stable runtime. There is
some minimum capacity you need for there to be an equilibrium though, or
else the runtime doesn't stabilize.

Regarding the sorting, I believe the complexity in Hadoop is more like n / R
* log(n/R), where R is in the number of reducers. And in practice, I've
found sorting to not take very long. But you're right - fundamentally this
model is an approximation.

On Tue, Jan 5, 2010 at 5:29 PM, Yuri Pradkin <yuri@isi.edu> wrote:

> On Sunday 03 January 2010 11:30:29 Nathan Marz <nathan.marz@gmail.com>
> wrote:
> > I did some analysis on the performance of Hadoop-based workflows. Some of
> > the results are counter-intuitive so I thought the community at large
> would
> > be interested:
> >
> > http://nathanmarz.com/blog/hadoop-mathematics/
> >
> > Would love to hear any feedback or comments you have.
> >
> Just thinking out loud:  runtime generally is not equal to the "hours of
> data".  In your processing model it's your next iteration that will have
> this
> amount of data.  By equating them, you're assuming an equilibrium case.
>  But
> if you're in an equilibrium, what would increasing the capacity mean? - I
> think it'd mean that you process your so-far accumulated data faster, so
> your
> next iteration will have less data to process and so on until you get to a
> new
> equilibrium (but how fast will you get there?).  Lesson learned: increasing
> capacity shifts equilibrium.  It's kind of like a reaction time of your
> system...  Sometimes, I imagine, as long as you're keeping up with the
> arrival
> rate you don't care if it takes a week's full or a day's.  In fact there
> may
> be some constraints on the minimum size of the input.
> Another comment: you're assuming that processing time is linear in the
> input
> size.  Of course it depends on the processing you're doing, but even if
> processing is linear, Hadoop needs to sort keys, and that is at best
> O(n*log(n)), so there is inherent non-linearity present.
> Similarly about the number of nodes: doubling your nodes will not double
> your
> service rate for a variety of reasons.
>  -Yuri

Nathan Marz
Twitter: @nathanmarz

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message