hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From David Pollak <...@athena.com>
Subject Re: some newby questions
Date Wed, 08 Nov 2006 17:40:41 GMT

On Nov 8, 2006, at 7:41 AM, Dennis Kubes wrote:

> I don't know if I completely understand what you are asking but let  
> me try to answer your questions.
> David Pollak wrote:
>> Howdy,
>> Is there a way to store "by-product" data someplace where it can  
>> be read?  For example, as I'm iterating over a collection of  
>> documents, I want to generate some statistics about the  
>> collection, put those stats "someplace" that can be accessed  
>> during future map-reduce cycles.  Should I simply run a "faux" map- 
>> reduce cycle to count the information and store it in a known  
>> location in the DFS?
> Usually you would run a MapReduce job to store intermediate results  
> and then another job to process aggregated or final results.   
> Sometimes this can be done in a single job, sometimes not.  Take a  
> look at the Hadoop example for Grep or WordCount for example jobs.

Yep.  I'm able to chain jobs together.  In one case, I am counting  
URLs and Noun Phrases for documents retrieved during a certain run.   
In order to normalize the URLs and NP counts, I want to divide by the  
total number of URLs or NPs for that time period.  I seem to have 2  
1 - I can aggregate the counts during the Map/Reduce task that culls  
the URLs and NPs
2 - I can run another Map/Reduce task on the URL and NP sets to count  
the number of documents.

It seems that if I do the latter, it's another iteration over the  
data set which seems expensive.  Is #2 the best choice?

>> Is there a way to map a collection of words or documents to  
>> associated numbers so that indexing could be based on the word  
>> number and/or document number rather than actual word and actual  
>> URL?  Because the reduce tasks take place in separate processes,  
>> it seems that there's no way to coordinate the ordinal counting.
> If you are talking about Index document id then you would need to  
> read the index and map url to document id and then a second job  
> would map id to whatever else by url.  If you are wanting count  
> each word globally across all tasks and splits you can coordinate  
> it within splits by using a MapRunner but across splits I don't  
> know of a way to do that.

Yep... I'm looking to generate a unique ID across all the MR tasks.   
Basically, I want a file that looks like:
apple 1
beta 2
cat 3
dog 4
moose 5

Is there a final merge task that merges all the reductions together?   
If so, perhaps I could do the count in the final merge.  Any idea is  
the final merge is accessible?



> Dennis
>> Thanks,
>> David

View raw message