hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ted Dunning <tdunn...@veoh.com>
Subject Re: newbie seeking inputs and help
Date Sun, 21 Oct 2007 00:59:50 GMT

Look for the slide show on Nutch and Hadoop.


open the one called "Scalable Computing with Hadoop (Doug Cutting, May

On 10/20/07 1:53 PM, "Jim the Standing Bear" <standingbear@gmail.com> wrote:

> Hi,
> I have been studying map reduce and hadoop for the past few weeks, and
> found it a very new concept.  While I have a grasp of the map reduce
> process as well as being able to follow some of the example code, I
> still feel at a loss when it comes to creating my own exercise
> "project" and would appreciate any inputs and help on that.
> The project I am having in mind is to leech several (hundred) HTML
> files from a website, and use hadoop to index the words of each page
> so they can be later searched.  However, in all examples I have seen
> so far, the data are split into HDFS prior to the execution of the
> job.
> Here is the set of questions I have:
> 1. Is CopyFiles.HTTPCopyFilesMapper and/or ServerAddress what I need
> for this project
> 2. If so, are there any detailed documentations/examples on these classes?
> 3. If not, could you please let me know conceptually how you would go
> about doing this?
> 3. If data must be split beforehand, do I must manually retrieve all
> the webpages and load them into HDFS?  or do I list the URLs of the
> webpages into a text file and split this file instead?
> As you can see, I am very confused at this point and would greatly
> appreciate all the help I could get.  Thanks!
> -- Jim

View raw message