hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Scott Carey <sc...@richrelevance.com>
Subject RE: MapReduce jobs with expensive initialization
Date Sun, 01 Mar 2009 19:21:11 GMT
You could create a singleton class and reference the dictionary stuff in that.  You would probably
want this separate from other classes as to control exactly what data is held on to for a
long time and what is not.

class Singleton {

private static final _instance Singleton = new Singleton();

private Singleton() {
 ... initialize here, only ever called once per classloader or JVM; 

public Singleton getSingleton() {
return _instance;

in mapper:

Singleton dictionary = Singleton.getSingleton();

This assumes that each mapper doesn't live in its own classloader space (which would make
even static singletons not shareable), and has the drawback that once initialized, that memory
associated with the singleton won't go away until the JVM or classloader that hosts it dies.

I have not tried this myself, and do not know the exact classloader semantics used in the
new 'persistent' task JVMs.  They could have a classloader per job, and dispose of those when
the job is complete -- though then it is impossible to persist data across jobs but only within
them.  Or there could be one permanent persisted classloader, or one per task.   All will
behave differently with respect to statics like the above example.

From: Stuart White [stuart.white1@gmail.com]
Sent: Saturday, February 28, 2009 6:06 AM
To: core-user@hadoop.apache.org
Subject: MapReduce jobs with expensive initialization

I have a mapreduce job that requires expensive initialization (loading
of some large dictionaries before processing).

I want to avoid executing this initialization more than necessary.

I understand that I need to call setNumTasksToExecutePerJvm to -1 to
force mapreduce to reuse JVMs when executing tasks.

How I've been performing my initialization is, in my mapper, I
override MapReduceBase#configure, read my parms from the JobConf, and
load my dictionaries.

It appears, from the tests I've run, that even though
NumTasksToExecutePerJvm is set to -1, new instances of my Mapper class
are being created for each task, and therefore I'm still re-running
this expensive initialization for each task.

So, my question is: how can I avoid re-executing this expensive
initialization per-task?  Should I move my initialization code out of
my mapper class and into my "main" class?  If so, how do I pass
references to the loaded dictionaries from my main class to my mapper?

View raw message