hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dennis Kubes <nutch-...@dragonflymc.com>
Subject Re: Initialization costs
Date Mon, 30 Oct 2006 16:00:46 GMT
We have been doing some similar things but we use custom MapRunner 
classes to load the resouces, for example files that need to be opened 
or a shared cache to reduce network reads, once per Map split and then 
pass the resources into the Map tasks.  Here is an example of what it 
might look like:


public class YourRunner
  implements MapRunnable {

  private JobConf job;
  private YourMapper mapper;
  private Class inputKeyClass;
  private Class inputValueClass;

  public void configure(JobConf job) {
    this.job = job;
    this.inputKeyClass = job.getInputKeyClass();
    this.inputValueClass = job.getInputValueClass();

  private void closeReaders(MapFile.Reader[] readers) {

    if (readers == null)
    for (int i = 0; i < readers.length; i++) {
      try {
      catch (Exception e) {


  public void run(RecordReader input, OutputCollector output, Reporter 
    throws IOException {

    final FileSystem fs = FileSystem.get(job);

    Configuration conf = NutchConfiguration.create();
    mapper = new YourMapper();

    Path filesPath= new Path(parent, mapfiledir);
    MapFile.Reader[] readers= MapFileOutputFormat.getReaders(fs,
      filesPath, conf);
    Map <Integer, String> cache= new HashMap <Integer, String>();


    try {

      WritableComparable key = 
      Writable value = (Writable)job.newInstance(inputValueClass);

      while (input.next(key, value)) {
        mapper.map(key, value, output, reporter);
    finally {


Grant Ingersoll wrote:
> I know in general that I shouldn't worry too much about initialization 
> costs, as they will be amortized over the life of the job and are 
> often a drop in the bucket time wise.  However, in my setup I have a 
> conf() method that needs to load in some resources from disk.   This 
> is on a per job basis currently.  I know that each node in my cluster 
> is going to need these resources and every job I submit is going to 
> end up doing this same thing.  So I was wondering if there was anyway 
> these resources could be loaded once per startup of the task tracker.  
> In some sense, this is akin to putting something into application 
> scope in a webapp as opposed to session scope.
> Thanks,
> Grant

View raw message