hadoop-general mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Alex Baranov <alex.barano...@gmail.com>
Subject Re: setNumReduceTasks(1)
Date Thu, 28 Jan 2010 06:04:24 GMT
Since MapReduce programming model defines only one "communication" point
between jobs - the one that occurs after all Map tasks are done and before
Reduce tasks begin I believe that anyway the solution to your problem will
come at a price of lower performance.

Although I don't think that while having 10 workers you should use
"setNumReduceTasks(1)" tactics since the performance will be very degraded.
Of course this depends on N number a lot: if N is quite small and everything
you're going to output from your reduce task is going to lay down in one
datanode then *may be* your strategy can be considered.

If N is really very big and there is lot of work to do before Reducers
should stop, then I'd consider communication throught storing the info about
a progress in DFS (implementation will not be straightforward though, since
we don't want to affect performance a lot).

Alex.

On Tue, Jan 26, 2010 at 1:22 AM, Something Something <
mailinglists19@gmail.com> wrote:

> If I set # of reduce tasks to 1 using setNumReduceTasks(1), would the class
> be instantiated only on one machine.. always?  I mean if I have a cluster
> of
> say 1 master, 10 workers & 3 zookeepers, is the Reducer class guaranteed to
> be instantiated only on 1 machine?
>
> If answer is yes, then I will use static variable as a counter to see how
> may rows have been added to my HBase table so far.  In my use case, I want
> to write only N number of rows to a table.  Is there a better way to do
> this?  Please let me know.  Thanks.
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message