hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From David Rosenstrauch <dar...@darose.net>
Subject Re: Memory intensive jobs and JVM reuse
Date Thu, 29 Apr 2010 13:57:12 GMT
On 04/29/2010 08:58 AM, Danny Leshem wrote:
> Hello,
>
> I'm using Hadoop to run a memory intensive job on different input datum.
> The job requires the availability (in memory) of some read-only HashMap,
> about 4Gb in size.
> The same fixed HashMap is used for all input datum.
>
> I'm using a cluster of EC2 machines with more than enough memory (around 7Gb
> each) to hold a single instance of the HashMap in full.
> The problem is that each MapReduce task runs in its own process, so the
> HashMap is replicated times the number of per-machine tasks - not good!
>
> According to the next link, you can force Hadoop to run multiple tasks (of
> the same job) in the same JVM:
> http://hadoop.apache.org/common/docs/current/mapred_tutorial.html#Task+JVM+Reuse
> This doesn't seem to work for me - I still see several Java processes
> spawned.
>
> But even if it did work, running several jobs in parallel (say, on different
> datum) would still require the HashMap to be replicated!
> Can one force Hadoop to run all jobs in the same JVM? (as opposed to just
> all tasks of a given job)
>
> If not, what's the recommended paradigm for running multiple instances of a
> job that requires large read-only structures in memory?
>
> Thanks!
> Danny

You could serialize the map and output it to the DistributedCache.  Then 
it would be available for all map & reduce tasks to use.  (Though they'd 
have to read it back in and deserialize it first, of course.)

HTH,

DR

Mime
View raw message