hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jimmy Lin <jimmy...@umd.edu>
Subject Re: Coordination between Mapper tasks
Date Sun, 22 Mar 2009 00:37:26 GMT
Hi Stuart,

You might want to look at a memcached solution some students and I 
worked out for exactly this problem.  It's written up in:

Jimmy Lin, Anand Bahety, Shravya Konda, and Samantha Mahindrakar. 
Low-Latency, High-Throughput Access to Static Global Resources within 
the Hadoop Framework. Technical Report HCIL-2009-01, University of 
Maryland, College Park, January 2009.

Available at:



Stuart White wrote:
> Thanks to everyone for your feedback.  I'm unfamiliar with many of the
> technologies you've mentioned, so it may take me some time to digest
> all your responses.  The first thing I'm going to look at is Ted's
> suggestion of a pure map-reduce solution by pre-joining my data with
> my lookup values.
> On Fri, Mar 20, 2009 at 9:55 AM, Owen O'Malley <owen.omalley@gmail.com> wrote:
>> On Thu, Mar 19, 2009 at 6:42 PM, Stuart White <stuart.white1@gmail.com>wrote:
>>> My process requires a large dictionary of terms (~ 2GB when loaded
>>> into RAM).  The terms are looked-up very frequently, so I want the
>>> terms memory-resident.
>>> So, the problem is, I want 3 processes (to utilize CPU), but each
>>> process requires ~2GB, but my nodes don't have enough memory to each
>>> have their own copy of the 2GB of data.  So, I need to somehow share
>>> the 2GB between the processes.
>> I would recommend using the multi-threaded map runner. Have 1 map/node and
>> just use 3 worker threads that all consume the input. The only disadvantage
>> is that it works best for cpu-heavy loads (or maps that are doing crawling,
>> etc.), since you only have one record reader for all three of the map
>> threads.
>> In the longer term, it might make sense to enable parallel jvm reuse in
>> addition to serial jvm reuse.
>> -- Owen

View raw message