hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Doug Cutting <cutt...@apache.org>
Subject Re: Initialization costs
Date Mon, 30 Oct 2006 17:48:37 GMT
A new JVM is used per task, so they'll need to be re-read per-task.

The job jar is cached locally by the tasktracker.  So it is only copied 
from dfs to the local disk once per job.  Its content is shared by all 
tasks in that job.  So you can include shared files in a job jar.  Tasks 
are run connected to a directory with the unpacked jar contents.

You can also add other files that should be cached to your job, using 
the DistributedCache API:

http://lucene.apache.org/hadoop/docs/api/org/apache/hadoop/filecache/DistributedCache.html

But including things in the job jar is far simpler.

Doug

Grant Ingersoll wrote:
> I know in general that I shouldn't worry too much about initialization 
> costs, as they will be amortized over the life of the job and are often 
> a drop in the bucket time wise.  However, in my setup I have a conf() 
> method that needs to load in some resources from disk.   This is on a 
> per job basis currently.  I know that each node in my cluster is going 
> to need these resources and every job I submit is going to end up doing 
> this same thing.  So I was wondering if there was anyway these resources 
> could be loaded once per startup of the task tracker.  In some sense, 
> this is akin to putting something into application scope in a webapp as 
> opposed to session scope.
> 
> Thanks,
> Grant

Mime
View raw message