hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dennis Kubes <nutch-...@dragonflymc.com>
Subject Re: some newby questions
Date Wed, 08 Nov 2006 15:41:13 GMT
I don't know if I completely understand what you are asking but let me 
try to answer your questions.

David Pollak wrote:
> Howdy,
> Is there a way to store "by-product" data someplace where it can be 
> read?  For example, as I'm iterating over a collection of documents, I 
> want to generate some statistics about the collection, put those stats 
> "someplace" that can be accessed during future map-reduce cycles.  
> Should I simply run a "faux" map-reduce cycle to count the information 
> and store it in a known location in the DFS?
Usually you would run a MapReduce job to store intermediate results and 
then another job to process aggregated or final results.  Sometimes this 
can be done in a single job, sometimes not.  Take a look at the Hadoop 
example for Grep or WordCount for example jobs. 
> Is there a way to map a collection of words or documents to associated 
> numbers so that indexing could be based on the word number and/or 
> document number rather than actual word and actual URL?  Because the 
> reduce tasks take place in separate processes, it seems that there's 
> no way to coordinate the ordinal counting.
If you are talking about Index document id then you would need to read 
the index and map url to document id and then a second job would map id 
to whatever else by url.  If you are wanting count each word globally 
across all tasks and splits you can coordinate it within splits by using 
a MapRunner but across splits I don't know of a way to do that.
> There's a MapFile construct that looks like it could be very useful 
> for my application, but there's no documentation for MapFile.  Does 
> anybody have pointers to documentation or example code?
MapFile is easy to use, just use a MapFileOutputFormat as your output 
format, your key becomes the Map key and the value become the Map value.

> Thanks,
> David

View raw message