hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Doug Cutting <cutt...@apache.org>
Subject Re: Serializing code to nodes: no can do?
Date Tue, 24 Apr 2007 16:45:19 GMT
Pedro Guedes wrote:
> For this I need to be able to register new steps in my chain and pass
> them to hadoop to execute as a mapreduce job. I see two choices here:
> 1 - build a .job archive (main-class: mycrawler, submits jobs thru
> JobClient) with my new steps and dependencies in the 'lib/' directory,
> include my 'crawling-chain.xml' in the .job (to pass the configuration
> of the chain to my crawler nodes) and then run it with the RunJar
> utility thru a new Thread (so that i have a clean classpath, right?).
> 2 - in a new thread, configure my classpath to include the classes
> needed by the crawling chain, write my crawler-chain to HDFS so that the
> nodes can then read it on job execution, and then submit jobs thru
> JobClient. When starting the mapreduce, nodes would then first read the
> crawling-chain from hdfs, and then execute it on map or reduce.
> Have I got it right? Which one sounds better?

If I understand your question, I think (1) is preferable.  The MapReduce 
system copies the job jar into HDFS and then reads it from nodes when 
running tasks.  This is optimized in several ways (file replication, 
caching, etc.) and is thus probably superior to implemnting something 
similar yourself, as described in (2).

> Another question: if I use the setJar of the jobConf, will hadoop
> include 'lib/*.jar' on the job.jar it sends to nodes?

Hadoop sends the entire job jar file.  Jobs are run connected to a 
directory where the job jar has been unpacked.  The classpath contains 
two directories from the jar, the top-level directory and the 'classes/' 
directory, plus all jar files contained in the 'lib/' directory.


View raw message