A possible solution is to emit only N rows from each mapper and then use
1 reduce task [*]  if value of N is not very high.
So you end up with utmost m * N rows on reducer instead of full inputset
 and so the limit can be done easier.
If you ok with some sort of variance in the number of rows inserted (and
if value of N is very high), you can do more interesting things like
N/m' rows per mapper  and multiple reducers (r) : with assumtion that
each reducer will see atleast N/r rows  and so you can limit to N/r per
reducer : ofcourse, there is a possible error that gets introduced here ...
Regards,
Mridul
[*] Assuming you just want simple limit  nothing else.
Also note, each mapper might want to emit N rows instead of 'tweaks'
like N/m rows, since it is possible that multiple mappers might have
less than N/m rows to emit to begin with !
Something Something wrote:
> If I set # of reduce tasks to 1 using setNumReduceTasks(1), would the class
> be instantiated only on one machine.. always? I mean if I have a cluster of
> say 1 master, 10 workers & 3 zookeepers, is the Reducer class guaranteed to
> be instantiated only on 1 machine?
>
> If answer is yes, then I will use static variable as a counter to see how
> may rows have been added to my HBase table so far. In my use case, I want
> to write only N number of rows to a table. Is there a better way to do
> this? Please let me know. Thanks.
