hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Nick Jones <darel...@gmail.com>
Subject Re: Memory intensive jobs and JVM reuse
Date Fri, 30 Apr 2010 02:31:57 GMT
On 4/29/2010 10:52 AM, Aaron Kimball wrote:
> * JVM reuse only applies within the same job. Different jobs are 
> always different JVMs
> * JVM reuse is serial; you'll only get task B in a JVM after task A 
> has already completed -- never both at the same time. If you configure 
> Hadoop to run 4 tasks per TT concurrently, you'll still have 4 JVMs up.
> There isn't an easy way to play tricks to get this large data 
> structure of yours to be shared between tasks or jobs, using Hadoop's 
> own mechanisms.  Instead, you need to look to an external system. 
> e.g., populate memcached with your hashmap ahead of time and have all 
> your jobs query that. Or use a BDB or TokyoCabinet or some other 
> disk-backed key-value database and have each task demand-load only the 
> (k, v) pairs that are relevant to the particular task.
> - Aaron
> On Thu, Apr 29, 2010 at 8:43 AM, David Rosenstrauch <darose@darose.net 
> <mailto:darose@darose.net>> wrote:
>     On 04/29/2010 11:08 AM, Danny Leshem wrote:
>         David,
>         DistributedCache distributes files across the cluster - it is
>         not a shared
>         memory cache.
>         My problem is not distributing the HashMap across machines,
>         but the fact
>         that it is replicated in memory for each task (or each job,
>         for that
>         matter).
>     OK, sorry for the misunderstanding.
>     Hmmm ... well ... I thought there was a config parm to control
>     whether the task gets launched in a new VM, but I can't seem to
>     find it.  A quick look at the list of map/reduce parms turned up
>     this, though:
>     mapred.job.reuse.jvm.num.tasks  default: 1      How many tasks to
>     run per jvm. If set to -1, there is no limit.
>     Perhaps that might help?  I'm speculating here, but in theory, if
>     you set it to -1, then all task attempts per job per node would
>     run in the same VM ... and so be able to have access to the same
>     static variables.
>     You might also want to poke around in the full list of map/reduce
>     config parms and see if there's anything else in there that might
>     help solve this:
>     http://hadoop.apache.org/common/docs/current/mapred-default.html
>     HTH,
>     DR
Couldn't the DistributedCache idea still work with a chained set of 
jobs?  Map the first set into files on the DFS and add them to the DC 
for the next time through?

Nick Jones

View raw message