That's a great way of putting it: "increasing capacity shifts the
equilibrium". If you work some examples you'll find that it doesn't take
many iterations for a workflow to converge to its stable runtime. There is
some minimum capacity you need for there to be an equilibrium though, or
else the runtime doesn't stabilize.
Regarding the sorting, I believe the complexity in Hadoop is more like n / R
* log(n/R), where R is in the number of reducers. And in practice, I've
found sorting to not take very long. But you're right  fundamentally this
model is an approximation.
On Tue, Jan 5, 2010 at 5:29 PM, Yuri Pradkin <yuri@isi.edu> wrote:
> On Sunday 03 January 2010 11:30:29 Nathan Marz <nathan.marz@gmail.com>
> wrote:
> > I did some analysis on the performance of Hadoopbased workflows. Some of
> > the results are counterintuitive so I thought the community at large
> would
> > be interested:
> >
> > http://nathanmarz.com/blog/hadoopmathematics/
> >
> > Would love to hear any feedback or comments you have.
> >
>
> Just thinking out loud: runtime generally is not equal to the "hours of
> data". In your processing model it's your next iteration that will have
> this
> amount of data. By equating them, you're assuming an equilibrium case.
> But
> if you're in an equilibrium, what would increasing the capacity mean?  I
> think it'd mean that you process your sofar accumulated data faster, so
> your
> next iteration will have less data to process and so on until you get to a
> new
> equilibrium (but how fast will you get there?). Lesson learned: increasing
> capacity shifts equilibrium. It's kind of like a reaction time of your
> system... Sometimes, I imagine, as long as you're keeping up with the
> arrival
> rate you don't care if it takes a week's full or a day's. In fact there
> may
> be some constraints on the minimum size of the input.
>
> Another comment: you're assuming that processing time is linear in the
> input
> size. Of course it depends on the processing you're doing, but even if
> YOUR
> processing is linear, Hadoop needs to sort keys, and that is at best
> O(n*log(n)), so there is inherent nonlinearity present.
>
> Similarly about the number of nodes: doubling your nodes will not double
> your
> service rate for a variety of reasons.
>
> Yuri
>

Nathan Marz
Twitter: @nathanmarz
http://nathanmarz.com
