hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Todd Lipcon <t...@cloudera.com>
Subject Re: Multithreaded Reducer
Date Fri, 10 Apr 2009 20:36:12 GMT
On Fri, Apr 10, 2009 at 12:31 PM, jason hadoop <jason.hadoop@gmail.com>wrote:

> Hi Sagar!
> There is no reason for the body of your reduce method to do more than copy
> and queue the key value set into an execution pool.

Agreed. You probably want to use a either a bounded queue on your execution
pool, or even a SynchronousQueue to do handoff to executor threads.
Otherwise your reducer will churn through all of its inputs at IO rate,
potentially fill up RAM, and report "100%" complete way before it's actually
complete. Something like:

BlockingQueue<Runnable> queue = new SynchronousQueue<Runnable>();
threadPool = new ThreadPoolExecutor(WORKER_THREAD_COUNT,

> The close method will need to wait until the all of the items finish
> execution and potentially keep the heartbeat up with the task tracker by
> periodically reporting something. Sadly right now the reporter has to be
> grabbed from the reduce method as configure and close do not get an
> instance.

+1. You probably want to call reporter.progress() after each item is
processed by the worker threads.

> I believe the key and value objects are reused by the framework on the next
> call to reduce, so making a copy before queuing them into your thread pool
> is important.

+1 here too. You will definitely run into issues if you don't make a deep


  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message