mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jake Mannix <>
Subject Re: Goals for Mahout 0.7
Date Mon, 13 Feb 2012 19:11:00 GMT
Hi John,

  This is some very good feedback, and warrants serious discussion.  In
of this, I'm going to respond on the fly with some thoughts in this vein.

  We use Mahout at Twitter (the LDA stuff recently put in, and
in various places, among other things) in production, and we use it,
via command-line invocations of the $MAHOUT_HOME/bin/mahout shell
script.  It's invoked in an environment where we keep all of the parameters
passed in in various (revision controlled) config files and the inputs are
from a series Pig jobs which are invoked in similar ways, and the outputs on
HDFS are loaded by various and sundry processes in their own ways.

  So in general, I totally agree with you that having production *java*
apps call
into main() methods of other classes is extremely ugly and error-prone.   So
how would it look to interact via a nice java API to a system which was
to launch some (possibly iterative series of) MapReduce jobs?

  I guess I can see how this would go: DistributedLanczosSolver, for example
can be run without the main() method:

public int run(Path inputPath,
                 Path outputPath,
                 Path outputTmpPath,
                 Path workingDirPath,
                 int numRows,
                 int numCols,
                 boolean isSymmetric,
                 int desiredRank)

is something you could run right after instantiating a
DistributedLanczosSolver and
.setConf()'ing it.

So is that the kind of thing we'd want more of?  Or are you thinking of
nicer, where instead of just a response code, you get handles on java
objects which
are pointing to the output data sets in some way?  I suppose it's not
terribly hard
to just do

  DistributedRowMatrix outputData =
     new DRM(outputPath, myTmpPath, numRows, numCols);

after running another job, but maybe it would be even nicer to return a
thing which has all the relevant output data as java objects.

Another thing would be making sure that running these classes didn't require
such long method argument lists - builders to the rescue!


On Mon, Feb 13, 2012 at 9:31 AM, John Conwell <> wrote:

> From my perspective, I'd really like to see the Mahout API migrate away
> from a command line centric design it currently utilizes, and migrate more
> towards an library centric API design.  I think this would go a long way in
> getting Mahout adopted into real life commercial applications.
> While there might be a few algorithm drivers that you interact with by
> creating an instance of a class, and calling some method(s) on the instance
> to interact with it (I havent actually seen one like that, but there might
> be a few), many algorithms are invoked by calling some static function on a
> class that takes ~37 typed arguments.  Buts whats worse, many drivers are
> invoked by having to create a String array with ~37 arguments as string
> values, and calling the static main function on the class.
> Now I'm not saying that having a static main function available to invoke
> an algorithm from the command line isn't useful.  It is, when your testing
> an algorithm.  But once you want to integrate the algorithm into a
> commercial workflow it kind of sucks.
> For example, immagine if the API for invoking Math.max was designed the way
> many of the Mahout algorithms currently are?  You'd have something like
> this:
> String[] args = new String[2];
> args[0] = "max";
> args[1] = "7";
> args[0] = "4";
> int max = Math.main(args);
> It makes your code a horrible mess and very hard to maintain, as well as
> very prone to bugs.
> When I see a bunch of static main functions as the only way to interact
> with a library, no matter what the quality of the library is, my initial
> impression is that this has to be some minimally supported effort by a few
> PhD candidates still in academia, who will drop the project as soon as they
> graduate.  And while this might not be the case, it is one of the first
> impressions it gives, and can lead a company to drop the library from
> consideration before they do any due diligence into its quality and
> utility.
> I think as Mahout matures and gets closer to a 1.0 release, this kind of
> API re-design will become more and more necessary, especially if you want a
> higher Mahout integration rate into commercial applications and workflows.
> Also, I hope I dont sound too negative.  I'm very impressed with Mahout and
> its capabilities.  I really like that there is a well thought out class
> library of primitives for designing new serial and distributed machine
> learning algorithms.  And I think it has a high utility for integration
> into highly visible commercial projects.  But its high level public API
> really is a barrier to entry when trying to design commercial applications.
> On Sun, Feb 12, 2012 at 12:20 AM, Jeff Eastman
> <>wrote:
> > We have a couple JIRAs that relate here: We want to factor all the (-cl)
> > classification steps out of all of the driver classes (MAHOUT-930) and
> into
> > a separate job to remove duplicated code; MAHOUT-931 is to add a
> pluggable
> > outlier removal capability to this job; and MAHOUT-933 is aimed at
> > factoring all the iteration mechanics from each driver class into the
> > ClusterIterator, which uses a ClusterClassifier which is itself an
> > OnlineLearner. This will hopefully allow semi-supervised classifier
> > applications to be constructed by feeding cluster-derived models into the
> > classification process. Still kind of fuzzy at this point but promising
> too.
> >
> > On 2/11/12 2:29 PM, Frank Scholten wrote:
> >
> >> ...
> >>
> >> What kind of clustering refactoring do mean here? I did some work on
> >> creating bean configurations in the past (MAHOUT-612). I underestimated
> the
> >> amount of work required to do the entire refactoring. If this can be
> >> contributed and committed on a per-job basis I would like to help out.
> >>
> >>> ...
> >>>
> >>
> >>
> >
> --
> Thanks,
> John C

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message