hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Kyle Moses <kmo...@cs.duke.edu>
Subject Distributed Cache For 100MB+ Data Structure
Date Thu, 11 Oct 2012 17:12:54 GMT
Problem Background:
I have a Hadoop MapReduce program that uses a IPv6 radix tree to provide 
auxiliary input during the reduce phase of the second job in it's 
workflow, but doesn't need the data at any other point.
It seems pretty straight forward to use the distributed cache to build 
this data structure inside each reducer in the setup() method.
This solution is functional, but ends up using a large amount of memory 
if I have 3 or more reducers running on the same node and the setup time 
of the radix tree is non-trivial.
Additionally, the IPv6 version of the structure is quite a bit larger in 

Is there a "good" way to share this data structure across all reducers 
on the same node within the Hadoop framework?

Initial Thoughts:
It seems like this might be possible by altering the Task JVM Reuse 
parameters, but from what I have read this would also affect map tasks 
and I'm concerned about drawbacks/side-effects.

Thanks for your help!

View raw message