hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Spiros Papadimitriou" <spapa...@gmail.com>
Subject Re: Reusing jobs
Date Fri, 18 Apr 2008 04:47:10 GMT
Hi --

Not really sure that JVM startup is the main overhead -- you could take a
look at the logfiles of the individual TIPs and compare the timestamp of the
first log message to the time the jobtracker reports that TIP was started.
In my experience, that is well under a second (once the cluster has warmed
up), but please do correct me if I'm wrong -- I'd really be interested to
know what others observe.

BTW, some very rough benchmarks on something similar:
  http://www.cs.cmu.edu/~spapadim/hadoop/timeline.html

The last plot shows executing the job locally (with a chunk size of 128MB)
vs a hand-coded C++ program -- both do a simple regex match and then
construct a histogram of counts of the matched strings.  The overhead is
impressively small -- I'm assuming that local execution of a Hadoop job will
still fire up a separate JVM for each map chunk (I didn't double-check
this).

Cheers,
Spiros

On Thu, Apr 17, 2008 at 10:43 PM, Karl Wettin <karl.wettin@gmail.com> wrote:

> Ted Dunning skrev:
>
> > Hadoop has enormous startup costs that are relatively inherent in the
> > current design.
> >
> > Most notably, mappers and reducers are executed in a standalone JVM
> > (ostensibly for safety reasons).
> >
>
> Is it possible to hack in support to reuse JVMs? Keep it alive until timed
> out and have it execute the jobs by opening a socket and say hello? What
> classes should I start looking in? Could be a fun exercise.
>
>
>          karl
>
>
>
>
>
> >
> >
> > On 4/17/08 6:00 PM, "Karl Wettin" <karl.wettin@gmail.com> wrote:
> >
> >  Is it possible to execute a job more than once?
> > >
> > > I use map reduce when adding a new instance to a hierarchial cluster
> > > tree. It finds the least distant node and inserts the new instance as
> > > a
> > > sibling to that node.
> > >
> > > As far as I know it is in very the nature of this algorithm that one
> > > inserts one instance at a time, that this is how the second dimension
> > > is
> > > created that makes it better than a vector cluster. It would be
> > > possible
> > > to map all permutations of instances and skip the reduction, but that
> > > would result in many more calulations than iteratively training the
> > > tree
> > > as the latter only require one to test against the instances already
> > > inserted to the tree.
> > >
> > > Iteratively training this tree using Hadoop means executing one job
> > > per
> > > instance that measure distance to all instances in a file that I also
> > > append the new instance to once inserted in the tree.
> > >
> > > All of above is very inefficient, especially with a young tree that
> > > could be trained in nanoseconds locally. So I do that until it takes
> > > 20
> > > seconds to insert an instance.
> > >
> > > But really, this is all Hadoop framework overhead. I'm not quite sure
> > > of
> > > all it does when I execute a job, but it seems like quite a lot. And
> > > all
> > > I'm doing is executing a couple of identical jobs over and over again
> > > using new data.
> > >
> > > It would be very nice if I it just took a few milliseconds to do that.
> > >
> > >
> > >       karl
> > >
> >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message