Return-Path: Delivered-To: apmail-hadoop-core-user-archive@www.apache.org Received: (qmail 27297 invoked from network); 1 Mar 2009 19:21:50 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 1 Mar 2009 19:21:50 -0000 Received: (qmail 58233 invoked by uid 500); 1 Mar 2009 19:21:43 -0000 Delivered-To: apmail-hadoop-core-user-archive@hadoop.apache.org Received: (qmail 58198 invoked by uid 500); 1 Mar 2009 19:21:42 -0000 Mailing-List: contact core-user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: core-user@hadoop.apache.org Delivered-To: mailing list core-user@hadoop.apache.org Received: (qmail 58187 invoked by uid 99); 1 Mar 2009 19:21:42 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 01 Mar 2009 11:21:42 -0800 X-ASF-Spam-Status: No, hits=-0.0 required=10.0 tests=SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: local policy) Received: from [64.78.17.16] (HELO EXHUB018-1.exch018.msoutlookonline.net) (64.78.17.16) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 01 Mar 2009 19:21:33 +0000 Received: from EXVMBX018-1.exch018.msoutlookonline.net ([64.78.17.47]) by EXHUB018-1.exch018.msoutlookonline.net ([64.78.17.16]) with mapi; Sun, 1 Mar 2009 11:21:12 -0800 From: Scott Carey To: "core-user@hadoop.apache.org" Date: Sun, 1 Mar 2009 11:21:11 -0800 Subject: RE: MapReduce jobs with expensive initialization Thread-Topic: MapReduce jobs with expensive initialization Thread-Index: AcmZrclxTRI0TuIbQYSxs+jSKEE83wA81/Cv Message-ID: References: <4af5cd780902280606y2a08a6c6ie1e0a583f164ddeb@mail.gmail.com> In-Reply-To: <4af5cd780902280606y2a08a6c6ie1e0a583f164ddeb@mail.gmail.com> Accept-Language: en-US Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: acceptlanguage: en-US Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: quoted-printable MIME-Version: 1.0 X-Virus-Checked: Checked by ClamAV on apache.org You could create a singleton class and reference the dictionary stuff in th= at. You would probably want this separate from other classes as to control= exactly what data is held on to for a long time and what is not. class Singleton { private static final _instance Singleton =3D new Singleton(); private Singleton() { ... initialize here, only ever called once per classloader or JVM;=20 } public Singleton getSingleton() { return _instance; } in mapper: Singleton dictionary =3D Singleton.getSingleton(); This assumes that each mapper doesn't live in its own classloader space (wh= ich would make even static singletons not shareable), and has the drawback = that once initialized, that memory associated with the singleton won't go a= way until the JVM or classloader that hosts it dies.=20 I have not tried this myself, and do not know the exact classloader semanti= cs used in the new 'persistent' task JVMs. They could have a classloader p= er job, and dispose of those when the job is complete -- though then it is = impossible to persist data across jobs but only within them. Or there coul= d be one permanent persisted classloader, or one per task. All will behav= e differently with respect to statics like the above example. ________________________________________ From: Stuart White [stuart.white1@gmail.com] Sent: Saturday, February 28, 2009 6:06 AM To: core-user@hadoop.apache.org Subject: MapReduce jobs with expensive initialization I have a mapreduce job that requires expensive initialization (loading of some large dictionaries before processing). I want to avoid executing this initialization more than necessary. I understand that I need to call setNumTasksToExecutePerJvm to -1 to force mapreduce to reuse JVMs when executing tasks. How I've been performing my initialization is, in my mapper, I override MapReduceBase#configure, read my parms from the JobConf, and load my dictionaries. It appears, from the tests I've run, that even though NumTasksToExecutePerJvm is set to -1, new instances of my Mapper class are being created for each task, and therefore I'm still re-running this expensive initialization for each task. So, my question is: how can I avoid re-executing this expensive initialization per-task? Should I move my initialization code out of my mapper class and into my "main" class? If so, how do I pass references to the loaded dictionaries from my main class to my mapper? Thanks!=