hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Tom White <...@cloudera.com>
Subject Re: MapReduce jobs with expensive initialization
Date Mon, 02 Mar 2009 11:03:49 GMT
On any particular tasktracker slot, task JVMs are shared only between
tasks of the same job. When the job is complete the task JVM will go
away. So there is certainly no sharing between jobs.

I believe the static singleton approach outlined by Scott will work
since the map classes are in a single classloader (but I haven't
actually tried this).


On Mon, Mar 2, 2009 at 1:39 AM, jason hadoop <jason.hadoop@gmail.com> wrote:
> If you have to you can reach through all of the class loaders and find the
> instance of your singleton class that has the data loaded. It is awkward,
> and
> I haven't done this in java since the late 90's. It did work the last time I
> did it.
> On Sun, Mar 1, 2009 at 11:21 AM, Scott Carey <scott@richrelevance.com>wrote:
>> You could create a singleton class and reference the dictionary stuff in
>> that.  You would probably want this separate from other classes as to
>> control exactly what data is held on to for a long time and what is not.
>> class Singleton {
>> private static final _instance Singleton = new Singleton();
>> private Singleton() {
>>  ... initialize here, only ever called once per classloader or JVM;
>> }
>> public Singleton getSingleton() {
>> return _instance;
>> }
>> in mapper:
>> Singleton dictionary = Singleton.getSingleton();
>> This assumes that each mapper doesn't live in its own classloader space
>> (which would make even static singletons not shareable), and has the
>> drawback that once initialized, that memory associated with the singleton
>> won't go away until the JVM or classloader that hosts it dies.
>> I have not tried this myself, and do not know the exact classloader
>> semantics used in the new 'persistent' task JVMs.  They could have a
>> classloader per job, and dispose of those when the job is complete -- though
>> then it is impossible to persist data across jobs but only within them.  Or
>> there could be one permanent persisted classloader, or one per task.   All
>> will behave differently with respect to statics like the above example.
>> ________________________________________
>> From: Stuart White [stuart.white1@gmail.com]
>> Sent: Saturday, February 28, 2009 6:06 AM
>> To: core-user@hadoop.apache.org
>> Subject: MapReduce jobs with expensive initialization
>> I have a mapreduce job that requires expensive initialization (loading
>> of some large dictionaries before processing).
>> I want to avoid executing this initialization more than necessary.
>> I understand that I need to call setNumTasksToExecutePerJvm to -1 to
>> force mapreduce to reuse JVMs when executing tasks.
>> How I've been performing my initialization is, in my mapper, I
>> override MapReduceBase#configure, read my parms from the JobConf, and
>> load my dictionaries.
>> It appears, from the tests I've run, that even though
>> NumTasksToExecutePerJvm is set to -1, new instances of my Mapper class
>> are being created for each task, and therefore I'm still re-running
>> this expensive initialization for each task.
>> So, my question is: how can I avoid re-executing this expensive
>> initialization per-task?  Should I move my initialization code out of
>> my mapper class and into my "main" class?  If so, how do I pass
>> references to the loaded dictionaries from my main class to my mapper?
>> Thanks!

View raw message