hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Danny Leshem <dles...@gmail.com>
Subject Re: Memory intensive jobs and JVM reuse
Date Thu, 29 Apr 2010 15:08:05 GMT
David,

DistributedCache distributes files across the cluster - it is not a shared
memory cache.
My problem is not distributing the HashMap across machines, but the fact
that it is replicated in memory for each task (or each job, for that
matter).

On Thu, Apr 29, 2010 at 4:57 PM, David Rosenstrauch <darose@darose.net>wrote:

> On 04/29/2010 08:58 AM, Danny Leshem wrote:
>
>> Hello,
>>
>> I'm using Hadoop to run a memory intensive job on different input datum.
>> The job requires the availability (in memory) of some read-only HashMap,
>> about 4Gb in size.
>> The same fixed HashMap is used for all input datum.
>>
>> I'm using a cluster of EC2 machines with more than enough memory (around
>> 7Gb
>> each) to hold a single instance of the HashMap in full.
>> The problem is that each MapReduce task runs in its own process, so the
>> HashMap is replicated times the number of per-machine tasks - not good!
>>
>> According to the next link, you can force Hadoop to run multiple tasks (of
>> the same job) in the same JVM:
>>
>> http://hadoop.apache.org/common/docs/current/mapred_tutorial.html#Task+JVM+Reuse
>> This doesn't seem to work for me - I still see several Java processes
>> spawned.
>>
>> But even if it did work, running several jobs in parallel (say, on
>> different
>> datum) would still require the HashMap to be replicated!
>> Can one force Hadoop to run all jobs in the same JVM? (as opposed to just
>> all tasks of a given job)
>>
>> If not, what's the recommended paradigm for running multiple instances of
>> a
>> job that requires large read-only structures in memory?
>>
>> Thanks!
>> Danny
>>
>
> You could serialize the map and output it to the DistributedCache.  Then it
> would be available for all map & reduce tasks to use.  (Though they'd have
> to read it back in and deserialize it first, of course.)
>
> HTH,
>
> DR
>

Mime
View raw message