mahout-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sean Owen <>
Subject Re: Setting Number of Mappers and Reducers in DistributedRowMatrix Jobs
Date Fri, 11 Jun 2010 17:13:39 GMT
It's the same question as --input and -Dmapred.input.dir. The latter
is the standard Hadoop parameter, which we have to support if only
because this is something the user may be configuring in the XML
configs, but also because it'll be familiar to Hadoop users I assume.

Jobs can read and change these settings to implement additional
restrictions, sure. For example, the user-supplied input and output
dir are only used to control the first M/R input in a chain of M/Rs
run by a job, and the output of its final M/R. In between, it's
overriding this value on individual M/Rs as needed of course, to
direct intermediate output elsewhere.

So the question is not whether we need our own way to control Hadoop
parameters at times -- we very much do, and this already happens and
works internally. The question is merely one of command-line "UI",
duplicating Hadoop flags with our own.

I personally am inclined to not do this, as it's just more code, more
possibilities to support and debug, more difference from the norm.
However in the case of input and output I think we all agreed that
such a basic flag might as well have its own custom version that works
in the same way as the Hadoop one.

I'd argue we wouldn't want to do the same thing for number of mappers
and reducers. From there, why not duplicate about 10 other flags I can
think of? compressing map output, reducer output, IO sort buffer size,
etc etc.

On Fri, Jun 11, 2010 at 6:01 PM, Jeff Eastman
<> wrote:
> Over to dev list:
> Sean, we currently have some jobs which accept numbers of mappers and
> reducers as optional command arguments and others that require the -D
> arguments to control same as you have written. Seems like our usability
> would improve if we adopted a consistent policy across all Mahout
> components. If so, would you argue that all use -D arguments for this
> control? What about situations where our default is not whatever Hadoop does
> by default? Would this result in noticable behavior changes? Also, some
> algorithms don't work with arbitrary numbers of reducers and some don't use
> reducers at all. What would you suggest?

View raw message