hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Darren Govoni <dar...@ontrenet.com>
Subject Re: Big-O Notation for Hadoop
Date Mon, 01 Mar 2010 23:37:25 GMT
Its a Turing-class problem and thus non-deterministic by nature - a

But given the uniform aspect of map/reduce an estimate could continually
be approximated - as the data is processed - noting that,  the farther
from completion it is, the less accurate that calculation would be. And
of course, once completed, the estimate is 100% accurate.

In order to do it before, you would need an algorithm that can examine
your map/reduce code and predict the running cost. 
Without data on prior runs, its not -mathematically- possible. 

As a function of cycle complexity over time (which is what big O is),
map/reduce will scale somewhat linearly (maybe even logn) with regards
to data - is my hunch. There's probably a quotient in there for the
bookkeeping no one has data on yet though. But its a good inquiry.

On Mon, 2010-03-01 at 18:25 -0500, Edward Capriolo wrote:

> On Mon, Mar 1, 2010 at 4:13 PM, Darren Govoni <darren@ontrenet.com> wrote:
> > Theoretically. O(n)
> >
> > All other variables being equal across all nodes
> > should...mmmmm.....reduce to n.
> >
> > That part that really can't be measured is the cost of Hadoop's
> > bookkeeping chores as the data set grows since some things in Hadoop
> > involve synchronous/serial behavior.
> >
> > On Mon, 2010-03-01 at 12:27 -0500, Edward Capriolo wrote:
> >
> >> A previous post to core-user mentioned some formula to determine job
> >> time. I was wondering if anyone out there is trying to tackle
> >> designing a formula that can calculate the job run time of a
> >> map/reduce program. Obviously there are many variables here including
> >> but not limited to Disk Speed ,Network Speed, Processor Speed, input
> >> data, many constants , data-skew, map complexity, reduce complexity, #
> >> of nodes......
> >>
> >> As an intellectual challenge has anyone starting trying to write a
> >> formula that can take into account all these factors and try to
> >> actually predict a job time in minutes/hours?
> >
> >
> >
> Understood, BIG-0 notation is really not what I am looking for.
> Given all variables are the same, a hadoop job on a finite set of data
> should run for a finite time. There are parts of the process that run
> linear and parts that run in parallel, but there must be a way to
> express how long a job actually takes (although admittedly it is very
> involved to figure out)

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message