From stack <st...@duboce.net>
Subject Re: Map Reduce over HBase - sample code
Date Tue, 24 Jun 2008 14:57:02 GMT
Naama Kraus wrote:
> ..
> What if the mission was the following - for each course in the table,
> calculate the average grade in that course. In that case both map and reduce
> are required, is that correct ? Map will emit for each row a {course_name,
> grade} pair. Reduce will emit the average grades for each course
> (course_name, avg_grade}. Output can be put in a separate table (probably
> one holding courses information). Does this make sense ?
That'll work.

>> * At a higher level, I'd suggest a refactoring.  Do all of your work in
>> the map phase.  Have no reduce phase.  I suggest this because as is, all
>> rows emitted by the map are being sorted by the MR framework.  But hbase
>> will also do a sort on insert.   Avoid paying the prices of the MR sort.  Do
>> your calculation in the map and then insert the result at map time.   Either
>> emit nothing or, emit a '1' for every row processed so the MR counters tell
>> a story about your MR job.*
> That's an interesting point. So if both map and reduce are a required, then
> two sorts must take place. Is that correct ?
Yes but with your new example, they are orthogonal toward different 
ends; the first does collecting together all course data and the second 
orders courses in hbase lexicographically (presuming course is primary key).


