hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Danny Leshem <dles...@gmail.com>
Subject Memory intensive jobs and JVM reuse
Date Thu, 29 Apr 2010 12:58:41 GMT
Hello,

I'm using Hadoop to run a memory intensive job on different input datum.
The job requires the availability (in memory) of some read-only HashMap,
about 4Gb in size.
The same fixed HashMap is used for all input datum.

I'm using a cluster of EC2 machines with more than enough memory (around 7Gb
each) to hold a single instance of the HashMap in full.
The problem is that each MapReduce task runs in its own process, so the
HashMap is replicated times the number of per-machine tasks - not good!

According to the next link, you can force Hadoop to run multiple tasks (of
the same job) in the same JVM:
http://hadoop.apache.org/common/docs/current/mapred_tutorial.html#Task+JVM+Reuse
This doesn't seem to work for me - I still see several Java processes
spawned.

But even if it did work, running several jobs in parallel (say, on different
datum) would still require the HashMap to be replicated!
Can one force Hadoop to run all jobs in the same JVM? (as opposed to just
all tasks of a given job)

If not, what's the recommended paradigm for running multiple instances of a
job that requires large read-only structures in memory?

Thanks!
Danny

Mime
View raw message