Return-Path: Delivered-To: apmail-lucene-nutch-commits-archive@www.apache.org Received: (qmail 45432 invoked from network); 13 Mar 2007 22:23:20 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 13 Mar 2007 22:23:20 -0000 Received: (qmail 27216 invoked by uid 500); 13 Mar 2007 22:23:28 -0000 Delivered-To: apmail-lucene-nutch-commits-archive@lucene.apache.org Received: (qmail 27207 invoked by uid 500); 13 Mar 2007 22:23:28 -0000 Mailing-List: contact nutch-commits-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: nutch-dev@lucene.apache.org Delivered-To: mailing list nutch-commits@lucene.apache.org Received: (qmail 27196 invoked by uid 500); 13 Mar 2007 22:23:28 -0000 Delivered-To: apmail-incubator-nutch-commits@incubator.apache.org Received: (qmail 27193 invoked by uid 99); 13 Mar 2007 22:23:28 -0000 Received: from herse.apache.org (HELO herse.apache.org) (140.211.11.133) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 13 Mar 2007 15:23:28 -0700 X-ASF-Spam-Status: No, hits=-0.0 required=10.0 tests=SPF_HELO_PASS X-Spam-Check-By: apache.org Received: from [140.211.11.130] (HELO eos.apache.org) (140.211.11.130) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 13 Mar 2007 15:23:19 -0700 Received: from eos.apache.osuosl.org (localhost [127.0.0.1]) by eos.apache.org (Postfix) with ESMTP id 35D1F59A05 for ; Tue, 13 Mar 2007 22:22:59 +0000 (GMT) Content-Type: text/plain; charset="us-ascii" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit From: Apache Wiki To: nutch-commits@incubator.apache.org Date: Tue, 13 Mar 2007 22:22:59 -0000 Message-ID: <20070313222259.10119.94726@eos.apache.osuosl.org> Subject: [Nutch Wiki] Update of "Getting Started" by SteveSeverance X-Virus-Checked: Checked by ClamAV on apache.org Dear Wiki user, You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change notification. The following page has been changed by SteveSeverance: http://wiki.apache.org/nutch/Getting_Started New page: This page is a collection of information that is useful for new developers. Some of this is going to need to be moved to the Hadoop Wiki but I am putting it here first as I assemble this. Please feel free to add, comment and make corrections. Steve To new developers: If you want to begin to develop on Nutch do not forget to get started looking at the Hadoop source code. Hadoop is the platform that Nutch is implemented on. In order to understand anything about how Nutch works you need to also understand Hadoop. === What are the Hadoop primitives and how do I use them? Why are they there (what functionality do the add over regular primitives)? === These primitives implement the Hadoop Writable interface (or WritableComparable). What this does is gives Hadoop control over the serialization of these objects. If you look at the higher level Hadoop File System objects like ArrayFile you will see that they implement the same interfaces for serialization. Using these primitive types allows the serialization to be done in the same way as higher order data structures such as MapFile. === How does the Hadoop implementation of MapReduce work? === 1. First you need a JobConf. This class contains all the relevant information for the job. Information that you need to ensure that you include in the JobConf include: 2. Then you need to submit your job to Hadoop to be run. This is done by calling JobClient.runJob. JobClient. runJob submits the job for starting and handles receiving status updates back from the job. It starts by creating an instance of the JobClient. It continues to push the job toward execution by calling JobClient.submitJob 3. JobClient.submitJob handles splitting the input files and generating the MapReduce task. == Tutorials == * CountLinks Counting outbound links with MapReduce