hadoop-hdfs-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Kyle Moses <kmo...@cs.duke.edu>
Subject Re: Distributed Cache For 100MB+ Data Structure
Date Sat, 13 Oct 2012 14:46:46 GMT
Thanks for the suggestion on serializing the radix tree and your 
thoughts on the memory issue.  I'm planning to test a few different 
solutions and will post another reply if the results prove interesting.


On 10/11/2012 1:52 PM, Chris Nauroth wrote:
> Hello Kyle,
> Regarding the setup time of the radix tree, is it possible to 
> precompute the radix tree before job submission time, then create a 
> serialized representation (perhaps just Java object serialization), 
> and send the serialized form through distributed cache?  Then, each 
> reducer would just need to deserialize during setup() instead of 
> recomputing the full radix tree for every reducer task.  That might 
> save time.
> Regarding the memory consumption, when I've run into a situation like 
> this, I've generally solved it by caching the data in a separate 
> process and using some kind of IPC from the reducers to access it. 
>  memcache is one example, though that's probably not an ideal fit for 
> this data structure.  I'm aware of no equivalent solution directly in 
> Hadoop and would be curious to hear from others on the topic.
> Thanks,
> --Chris
> On Thu, Oct 11, 2012 at 10:12 AM, Kyle Moses <kmoses@cs.duke.edu 
> <mailto:kmoses@cs.duke.edu>> wrote:
>     Problem Background:
>     I have a Hadoop MapReduce program that uses a IPv6 radix tree to
>     provide auxiliary input during the reduce phase of the second job
>     in it's workflow, but doesn't need the data at any other point.
>     It seems pretty straight forward to use the distributed cache to
>     build this data structure inside each reducer in the setup() method.
>     This solution is functional, but ends up using a large amount of
>     memory if I have 3 or more reducers running on the same node and
>     the setup time of the radix tree is non-trivial.
>     Additionally, the IPv6 version of the structure is quite a bit
>     larger in memory.
>     Question:
>     Is there a "good" way to share this data structure across all
>     reducers on the same node within the Hadoop framework?
>     Initial Thoughts:
>     It seems like this might be possible by altering the Task JVM
>     Reuse parameters, but from what I have read this would also affect
>     map tasks and I'm concerned about drawbacks/side-effects.
>     Thanks for your help!

View raw message