hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dennis Kubes <nutch-...@dragonflymc.com>
Subject Re: Initialization costs
Date Mon, 30 Oct 2006 16:00:46 GMT
We have been doing some similar things but we use custom MapRunner 
classes to load the resouces, for example files that need to be opened 
or a shared cache to reduce network reads, once per Map split and then 
pass the resources into the Map tasks.  Here is an example of what it 
might look like:

Dennis

public class YourRunner
  implements MapRunnable {

  private JobConf job;
  private YourMapper mapper;
  private Class inputKeyClass;
  private Class inputValueClass;

  public void configure(JobConf job) {
    this.job = job;
    this.inputKeyClass = job.getInputKeyClass();
    this.inputValueClass = job.getInputValueClass();
  }

  private void closeReaders(MapFile.Reader[] readers) {

    if (readers == null)
      return;
    for (int i = 0; i < readers.length; i++) {
      try {
        readers[i].close();
      }
      catch (Exception e) {

      }
    }
  }

  public void run(RecordReader input, OutputCollector output, Reporter 
reporter)
    throws IOException {

    final FileSystem fs = FileSystem.get(job);

    Configuration conf = NutchConfiguration.create();
    mapper = new YourMapper();

    Path filesPath= new Path(parent, mapfiledir);
    MapFile.Reader[] readers= MapFileOutputFormat.getReaders(fs,
      filesPath, conf);
    Map <Integer, String> cache= new HashMap <Integer, String>();

    mapper.setCache(cache);
    mapper.setReaders(readers);

    try {

      WritableComparable key = 
(WritableComparable)job.newInstance(inputKeyClass);
      Writable value = (Writable)job.newInstance(inputValueClass);

      while (input.next(key, value)) {
        mapper.map(key, value, output, reporter);
      }
    }
    finally {
      mapper.close();
    }

    closeReaders(levelReaders);
  }
}


Grant Ingersoll wrote:
> I know in general that I shouldn't worry too much about initialization 
> costs, as they will be amortized over the life of the job and are 
> often a drop in the bucket time wise.  However, in my setup I have a 
> conf() method that needs to load in some resources from disk.   This 
> is on a per job basis currently.  I know that each node in my cluster 
> is going to need these resources and every job I submit is going to 
> end up doing this same thing.  So I was wondering if there was anyway 
> these resources could be loaded once per startup of the task tracker.  
> In some sense, this is akin to putting something into application 
> scope in a webapp as opposed to session scope.
>
> Thanks,
> Grant

Mime
View raw message