nutch-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <>
Subject [Nutch Wiki] Update of "Getting Started" by SteveSeverance
Date Tue, 13 Mar 2007 22:22:59 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change notification.

The following page has been changed by SteveSeverance:

New page:
This page is a collection of information that is useful for new developers. Some of this is
going to need to be moved to the Hadoop Wiki but I am putting it here first as I assemble
this. Please feel free to add, comment and make corrections.


To new developers: If you want to begin to develop on Nutch do not forget to get started looking
at the Hadoop source code. Hadoop is the platform that Nutch is implemented on. In order to
understand anything about how Nutch works you need to also understand Hadoop.

=== What are the Hadoop primitives and how do I use them? Why are they there (what functionality
do the add over regular primitives)? ===

These primitives implement the Hadoop Writable interface (or WritableComparable). What this
does is gives Hadoop control over the serialization of these objects. If you look at the higher
level Hadoop File System objects like ArrayFile you will see that they implement the same
interfaces for serialization. Using these primitive types allows the serialization to be done
in the same way as higher order data structures such as MapFile.

=== How does the Hadoop implementation of  MapReduce work? ===

 1. First you need a JobConf. This class contains all the relevant information for the job.
Information that you need to ensure that you include in the JobConf include:
 2. Then you need to submit your job to Hadoop to be run. This is done by calling JobClient.runJob.
JobClient. runJob submits the job for starting and handles receiving status updates back from
the job. It starts by creating an instance of the JobClient. It continues to push the job
toward execution by calling JobClient.submitJob
 3. JobClient.submitJob handles splitting the input files and generating the MapReduce task.

== Tutorials ==
 * CountLinks Counting outbound links with MapReduce

View raw message