hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jianmin Woo <jianmin_...@yahoo.com>
Subject Re: question about when shuffle/sort start working
Date Thu, 04 Jun 2009 01:35:46 GMT
Thanks a lot for your suggestions on the interplay between job and the driver, Chuck.

Yes, the job may hold some , say,  training data, which is needed in each round of the job.
I will check the link you provided. Actually, I am thinking some really light-weight map/reduce
jobs. For example, each job consist just summing up an integer or array of integers sent from
the map tasks. There maybe huge amount of this kind of jobs in the algorithm, so the efficiency
is critical. 


From: Chuck Lam <chuck.lam@gmail.com>
To: core-user@hadoop.apache.org
Sent: Tuesday, June 2, 2009 4:22:46 PM
Subject: Re: question about when shuffle/sort start working

I'm pretty sure JVM reuse is only for tasks within the same job. You can't
persist data from one job to the next this way. I don't think it's even
possible to guarantee the same set of nodes to be running one job as the
next job.

Let me see if I understand your situation correctly. You're implementing
some iterative algorithm where each iteration can be neatly implemented as a
MR job. It iterates until some convergence criterion is met (or some max
number of iterations, to avoid infinite loop). The challenge is that it's
not clear how a convergence criterion can be communicated globally to the
driver of this job chain, for the driver to decide when to stop.

I haven't tried this before, but if the convergence criterion is just a
single value, one hack your job can do is output the convergence criterion
as a Counter thru the Reporter object. At the end of each job (iteration),
the counter is checked to see if the convergence criterion meets the
stopping requirements. It launches the next job only if the stopping
requirement is not met.

Counters are automatically summed across all tasks. This should be fine if
your convergence criterion is something like a sum of residuals.

Counters are integers and most convergence criteria are floating point, so
you'll have to scale your numbers and round them to integers to approximate
things. (Like I said, it's a bit of a hack.)

If you have to persist a lot of data from one job to the next, Jimmy Lin et
al has tried adding a memcached system to Hadoop to do that.
It's a pretty heavy set up tho, but may be useful for some applications.

On Mon, Jun 1, 2009 at 11:31 PM, Jianmin Woo <jianmin_woo@yahoo.com> wrote:

> Thanks for your information, Todd.
> It's cool to address the job overhead in JVM reuse way. You mentioned that
> the code(and variables?) are kept in JIT cache, so can we refer the
> variables in the last iteration in some way? I am really caurious how this
> works in coding level. Could you please help to point out some
> material/resources I can start to get to know the usage of this feature? Is
> this implemented in the current 0.20 version? I am pretty fresh to haoop.
> :-)
> Thanks,
> Jianmin
> ________________________________
> From: Todd Lipcon <todd@cloudera.com>
> To: core-user@hadoop.apache.org
> Sent: Tuesday, June 2, 2009 1:36:17 PM
> Subject: Re: question about when shuffle/sort start working
> On Mon, Jun 1, 2009 at 7:49 PM, Jianmin Woo <jianmin_woo@yahoo.com> wrote:
> >
> > In this way, if some job fails, we can re-run the whole job sequence
> > starting from the latest checkpoint instead of the beginning. It will be
> > nice if all the sequence (it`s actually a loop which happens in many of
> the
> > machine learning algorithms, each loop contains a Map and Reduce step.
> > PageRank calculation is one of the examples) can be done in a single job.
> > Because if an algorithm takes 1000 steps to converge, we have to start
> 1000
> > jobs in the job sequence way, which is costly since of the start/stop of
> > jobs.
> Yep - as you mentioned this is very common in many algorithms on MapReduce,
> especially in machine learning and graph processing. The only way to do
> this
> is, like you said, run 1000 jobs.
> Job overhead is one thing that has been improved significantly in recent
> versions. JVM reuse is one such new-ish feature that helps here, since the
> overhead of spawning tasktracker children is eliminated, and your code
> stays
> in the JIT cache between iterations.
> -Todd

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message