hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Aleksandar Stupar <stupar.aleksan...@yahoo.com>
Subject Re: Memory intensive jobs and JVM reuse
Date Fri, 30 Apr 2010 06:45:44 GMT

this may sound silly, but what I would try is following:

- use CombinedFileInputFormat with mapred.max.split.size set to 4xBlockSize (can be more than
  this will reduce the number of input splits and therefor number of map tasks so you can
do the following:
- set mapred.tasktracker.map.tasks.maximum to 1, meaning you will only have one JVM per machine
- and finally you use MultithreadedMapper to get everything parallel (off course be careful
with multithreading issues) 

and off course JVM reuse with mapred.job.reuse.jvm.num.tasks  set to -1.

Please let me know if you try this.

Kind regards,
Aleksandar Stupar.

From: Nick Jones <darellik@gmail.com>
To: mapreduce-user@hadoop.apache.org
Sent: Fri, April 30, 2010 4:31:57 AM
Subject: Re: Memory intensive jobs and JVM reuse

  On 4/29/2010 10:52 AM, Aaron Kimball wrote: 
* JVM reuse only applies within the same job. Different
>jobs are always different JVMs
>* JVM reuse is serial; you'll only get task B in a JVM after
>task A has already completed -- never both at the same time. If you
>configure Hadoop to run 4 tasks per TT concurrently, you'll still have
>4 JVMs up.
>There isn't an easy way to play tricks to get this large data
>structure of yours to be shared between tasks or jobs, using Hadoop's
>own mechanisms.  Instead, you need to look to an external system. e.g.,
>populate memcached with your hashmap ahead of time and have all your
>jobs query that. Or use a BDB or TokyoCabinet or some other disk-backed
>key-value database and have each task demand-load only the (k, v) pairs
>that are relevant to the particular task.
>- Aaron
>On Thu, Apr 29, 2010 at 8:43 AM, David
>Rosenstrauch <darose@darose.net> wrote:
>On 04/29/2010 11:08 AM, Danny Leshem wrote:
>>>>>>DistributedCache distributes files across the cluster - it is not
>>>>>>memory cache.
>>>>>>My problem is not distributing the HashMap across machines, but the
>>>>>>that it is replicated in memory for each task (or each job, for that
>>OK, sorry for the misunderstanding.
>>>>Hmmm ... well ... I thought there was a config parm to control whether
>>the task gets launched in a new VM, but I can't seem to find it.  A
>>quick look at the list of map/reduce parms turned up this, though:
>>>>mapred.job.reuse.jvm.num.tasks  default: 1      How many tasks to run
>>per jvm. If set to -1, there is no limit.
>>>>Perhaps that might help?  I'm speculating here, but in theory, if you
>>set it to -1, then all task attempts per job per node would run in the
>>same VM ... and so be able to have access to the same static variables.
>>>>You might also want to poke around in the full list of map/reduce
>>config parms and see if there's anything else in there that might help
>>solve this:
Couldn't the DistributedCache idea still work with a chained set of
jobs?  Map the first set into files on the DFS and add them to the DC
for the next time through?

Nick Jones

View raw message