Its a Turingclass problem and thus nondeterministic by nature  a
priori.
But given the uniform aspect of map/reduce an estimate could continually
be approximated  as the data is processed  noting that, the farther
from completion it is, the less accurate that calculation would be. And
of course, once completed, the estimate is 100% accurate.
In order to do it before, you would need an algorithm that can examine
your map/reduce code and predict the running cost.
Without data on prior runs, its not mathematically possible.
As a function of cycle complexity over time (which is what big O is),
map/reduce will scale somewhat linearly (maybe even logn) with regards
to data  is my hunch. There's probably a quotient in there for the
bookkeeping no one has data on yet though. But its a good inquiry.
On Mon, 20100301 at 18:25 0500, Edward Capriolo wrote:
> On Mon, Mar 1, 2010 at 4:13 PM, Darren Govoni <darren@ontrenet.com> wrote:
> > Theoretically. O(n)
> >
> > All other variables being equal across all nodes
> > should...mmmmm.....reduce to n.
> >
> > That part that really can't be measured is the cost of Hadoop's
> > bookkeeping chores as the data set grows since some things in Hadoop
> > involve synchronous/serial behavior.
> >
> > On Mon, 20100301 at 12:27 0500, Edward Capriolo wrote:
> >
> >> A previous post to coreuser mentioned some formula to determine job
> >> time. I was wondering if anyone out there is trying to tackle
> >> designing a formula that can calculate the job run time of a
> >> map/reduce program. Obviously there are many variables here including
> >> but not limited to Disk Speed ,Network Speed, Processor Speed, input
> >> data, many constants , dataskew, map complexity, reduce complexity, #
> >> of nodes......
> >>
> >> As an intellectual challenge has anyone starting trying to write a
> >> formula that can take into account all these factors and try to
> >> actually predict a job time in minutes/hours?
> >
> >
> >
>
> Understood, BIG0 notation is really not what I am looking for.
>
> Given all variables are the same, a hadoop job on a finite set of data
> should run for a finite time. There are parts of the process that run
> linear and parts that run in parallel, but there must be a way to
> express how long a job actually takes (although admittedly it is very
> involved to figure out)
