crunch-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Brush,Ryan" <RBR...@CERNER.COM>
Subject Re: Reliably Parallelizing CPU-Intensive DoFns
Date Fri, 26 Sep 2014 13:59:09 GMT
There be dragons, but in years past I solved a similar problem with the MultiThreadedMapper
[1], and it would be possible to do something similar in a DoFn implementation. Basically
the you can read multiple inputs and farm them off to threads, then synchronize and flush
after N items are processed and do a final flush to the emitter in the cleanup(…) method.

There are lots of pitfalls to managing your own threads, of course. You’d need to detach
incoming values passed to the DoFn so they don’t get clobbered by other threads, it could
fight against Hadoop’s resource management (since Hadoop wants to manage how many threads
are running), and writing multi-threaded code is pretty terrible in general. But it’s an
option at least.

[1]
https://hadoop.apache.org/docs/stable/api/org/apache/hadoop/mapreduce/lib/map/MultithreadedMapper.html

On Sep 25, 2014, at 11:03 PM, Allan Shoup <allan.shoup@gmail.com<mailto:allan.shoup@gmail.com>>
wrote:

I failed to mention that the I don't have an opportunity to read the source - my input is
a PTable of Avro keys and values.

On Thu, Sep 25, 2014 at 8:48 PM, Josh Wills <josh.wills@gmail.com<mailto:josh.wills@gmail.com>>
wrote:
NLineSource, to control how many shards the small input is split up into?

J

On Thu, Sep 25, 2014 at 6:10 PM, Allan Shoup <allan.shoup@gmail.com<mailto:allan.shoup@gmail.com>>
wrote:
I have a very cpu-intensive DoFn which running over a relatively small input. Running on a
Hadoop cluster, the job that it is run in sometimes executes the function in map tasks and
sometimes in reduce tasks. What's the best way to reliably increase parallelization?

One option may be to force a reduce step and control the number of reducers. Are there any
better options?



CONFIDENTIALITY NOTICE This message and any included attachments are from Cerner Corporation
and are intended only for the addressee. The information contained in this message is confidential
and may constitute inside or non-public information under international, federal, or state
securities laws. Unauthorized forwarding, printing, copying, distribution, or use of such
information is strictly prohibited and may be unlawful. If you are not the addressee, please
promptly delete this message and notify the sender of the delivery error by e-mail or you
may call Cerner's corporate offices in Kansas City, Missouri, U.S.A at (+1) (816)221-1024.

Mime
View raw message