From hadoop-user-return-2740-apmail-lucene-hadoop-user-archive=lucene.apache.org@lucene.apache.org Sun Oct 21 01:00:57 2007 Return-Path: Delivered-To: apmail-lucene-hadoop-user-archive@locus.apache.org Received: (qmail 74873 invoked from network); 21 Oct 2007 01:00:57 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 21 Oct 2007 01:00:57 -0000 Received: (qmail 18931 invoked by uid 500); 21 Oct 2007 01:00:38 -0000 Delivered-To: apmail-lucene-hadoop-user-archive@lucene.apache.org Received: (qmail 18906 invoked by uid 500); 21 Oct 2007 01:00:38 -0000 Mailing-List: contact hadoop-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: hadoop-user@lucene.apache.org Delivered-To: mailing list hadoop-user@lucene.apache.org Received: (qmail 18897 invoked by uid 99); 21 Oct 2007 01:00:38 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 20 Oct 2007 18:00:38 -0700 X-ASF-Spam-Status: No, hits=2.8 required=10.0 tests=RCVD_IN_DNSWL_LOW,RCVD_NUMERIC_HELO,SPF_NEUTRAL X-Spam-Check-By: apache.org Received-SPF: neutral (athena.apache.org: local policy) Received: from [69.50.2.13] (HELO ex9.myhostedexchange.com) (69.50.2.13) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 21 Oct 2007 01:00:41 +0000 Received: from 75.80.179.210 ([75.80.179.210]) by ex9.hostedexchange.local ([69.50.2.13]) with Microsoft Exchange Server HTTP-DAV ; Sun, 21 Oct 2007 00:59:59 +0000 User-Agent: Microsoft-Entourage/11.3.3.061214 Date: Sat, 20 Oct 2007 17:59:50 -0700 Subject: Re: newbie seeking inputs and help From: Ted Dunning To: Message-ID: Thread-Topic: newbie seeking inputs and help Thread-Index: AcgTfa6H7OsLAH9wEdyeawAWy8rVfQ== In-Reply-To: Mime-version: 1.0 Content-type: text/plain; charset="US-ASCII" Content-transfer-encoding: 7bit X-Virus-Checked: Checked by ClamAV on apache.org Look for the slide show on Nutch and Hadoop. http://wiki.apache.org/lucene-hadoop/HadoopPresentations open the one called "Scalable Computing with Hadoop (Doug Cutting, May 2006)" On 10/20/07 1:53 PM, "Jim the Standing Bear" wrote: > Hi, > > I have been studying map reduce and hadoop for the past few weeks, and > found it a very new concept. While I have a grasp of the map reduce > process as well as being able to follow some of the example code, I > still feel at a loss when it comes to creating my own exercise > "project" and would appreciate any inputs and help on that. > > The project I am having in mind is to leech several (hundred) HTML > files from a website, and use hadoop to index the words of each page > so they can be later searched. However, in all examples I have seen > so far, the data are split into HDFS prior to the execution of the > job. > > Here is the set of questions I have: > > 1. Is CopyFiles.HTTPCopyFilesMapper and/or ServerAddress what I need > for this project > > 2. If so, are there any detailed documentations/examples on these classes? > > 3. If not, could you please let me know conceptually how you would go > about doing this? > > 3. If data must be split beforehand, do I must manually retrieve all > the webpages and load them into HDFS? or do I list the URLs of the > webpages into a text file and split this file instead? > > As you can see, I am very confused at this point and would greatly > appreciate all the help I could get. Thanks! > > -- Jim