hadoop-general mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mridul Muralidharan <mrid...@yahoo-inc.com>
Subject Re: setNumReduceTasks(1)
Date Thu, 28 Jan 2010 07:09:18 GMT

A possible solution is to emit only N rows from each mapper and then use 
1 reduce task [*] - if value of N is not very high.
So you end up with utmost m * N rows on reducer instead of full inputset 
- and so the limit can be done easier.

If you ok with some sort of variance in the number of rows inserted (and 
if value of N is very high), you can do more interesting things like 
N/m' rows per mapper - and multiple reducers (r) : with assumtion that 
each reducer will see atleast N/r rows - and so you can limit to N/r per 
reducer : ofcourse, there is a possible error that gets introduced here ...


[*] Assuming you just want simple limit - nothing else.
Also note, each mapper might want to emit N rows instead of 'tweaks' 
like N/m rows, since it is possible that multiple mappers might have 
less than N/m rows to emit to begin with !

Something Something wrote:
> If I set # of reduce tasks to 1 using setNumReduceTasks(1), would the class
> be instantiated only on one machine.. always?  I mean if I have a cluster of
> say 1 master, 10 workers & 3 zookeepers, is the Reducer class guaranteed to
> be instantiated only on 1 machine?
> If answer is yes, then I will use static variable as a counter to see how
> may rows have been added to my HBase table so far.  In my use case, I want
> to write only N number of rows to a table.  Is there a better way to do
> this?  Please let me know.  Thanks.

View raw message