Mailing-List: contact core-user-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: core-user@hadoop.apache.org
Received-SPF: pass (nike.apache.org: local policy)
From: Scott Carey <scott@richrelevance.com>
To: "core-user@hadoop.apache.org" <core-user@hadoop.apache.org>
Date: Sun, 1 Mar 2009 11:21:11 -0800
Subject: RE: MapReduce jobs with expensive initialization
Thread-Topic: MapReduce jobs with expensive initialization
Thread-Index: AcmZrclxTRI0TuIbQYSxs+jSKEE83wA81/Cv
Message-ID: 
 <BDFBB77C9E07BE4A984DAAE981D19F961AE363D98B@EXVMBX018-1.exch018.msoutlookonline.net>
References: <4af5cd780902280606y2a08a6c6ie1e0a583f164ddeb@mail.gmail.com>
In-Reply-To: <4af5cd780902280606y2a08a6c6ie1e0a583f164ddeb@mail.gmail.com>
Accept-Language: en-US
Content-Language: en-US
acceptlanguage: en-US
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: quoted-printable
MIME-Version: 1.0

You could create a singleton class and reference the dictionary stuff in th=
at.  You would probably want this separate from other classes as to control=
 exactly what data is held on to for a long time and what is not.

class Singleton {

private static final _instance Singleton =3D new Singleton();

private Singleton() {
 ... initialize here, only ever called once per classloader or JVM;=20
}

public Singleton getSingleton() {
return _instance;
}

in mapper:

Singleton dictionary =3D Singleton.getSingleton();

This assumes that each mapper doesn't live in its own classloader space (wh=
ich would make even static singletons not shareable), and has the drawback =
that once initialized, that memory associated with the singleton won't go a=
way until the JVM or classloader that hosts it dies.=20

I have not tried this myself, and do not know the exact classloader semanti=
cs used in the new 'persistent' task JVMs.  They could have a classloader p=
er job, and dispose of those when the job is complete -- though then it is =
impossible to persist data across jobs but only within them.  Or there coul=
d be one permanent persisted classloader, or one per task.   All will behav=
e differently with respect to statics like the above example.

________________________________________
From: Stuart White [stuart.white1@gmail.com]
Sent: Saturday, February 28, 2009 6:06 AM
To: core-user@hadoop.apache.org
Subject: MapReduce jobs with expensive initialization

I have a mapreduce job that requires expensive initialization (loading
of some large dictionaries before processing).

I want to avoid executing this initialization more than necessary.

I understand that I need to call setNumTasksToExecutePerJvm to -1 to
force mapreduce to reuse JVMs when executing tasks.

How I've been performing my initialization is, in my mapper, I
override MapReduceBase#configure, read my parms from the JobConf, and
load my dictionaries.

It appears, from the tests I've run, that even though
NumTasksToExecutePerJvm is set to -1, new instances of my Mapper class
are being created for each task, and therefore I'm still re-running
this expensive initialization for each task.

So, my question is: how can I avoid re-executing this expensive
initialization per-task?  Should I move my initialization code out of
my mapper class and into my "main" class?  If so, how do I pass
references to the loaded dictionaries from my main class to my mapper?

Thanks!=