mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jake Mannix <jake.man...@gmail.com>
Subject Re: Goals for Mahout 0.7
Date Wed, 22 Feb 2012 18:44:59 GMT
On Wed, Feb 22, 2012 at 10:00 AM, John Conwell <john@iamjohn.me> wrote:

> I've been meaning to respond with my thoughts to this (though it took me
> almost two weeks to get around to it).
>
> Jake, your example of the DistributedLanczosSolver in how to interact with
> the different algorithms is along the lines of what I was thinking, at
> least as a bare minimum.  I'm a huge fan of using Builder classes for these
> types of scenarios, but I do understand that they are a pain to write, so
> in the short term to get all the algorithms API friendly by just having run
> functions with typed arguments is fine.  Anything to get rid of my String[]
> args variables I'm creating and passing around.
>
> You also mention the output to the algorithm APIs.  I'm not a big fan of
> the returned 1 or 0 response codes.  Seeing that sends me into COM hResult
> PTSD invoked panic attacks (NOTE: I'm not making light of PTSD).  Except
> its worse than hResults, because at least there were multiple hResults
> values that theoretically I could look up to figure out the actual problem
> that occurred.
>
> If I had my way, I would want the API output to return me two things:
> handles/objects that point to all the generated output of the algorithm
> (like you mentioned), and an object that gives me all the information I
> need to track the Hadoop mapreduce jobs that were invoked by the API call.
>
> The first one is a nice to have.  Since I most likely pass in a Path object
> to where I want the output to go, I know where the output is, and I should
> be able to infer what type of data it is, and so forth.  Having output
> handles to this data would be really nice, and make integrating Mahout into
> larger workflows much easier, but its not a show stopper.
>
> But the second one is VERY important and can be a show stopper.  Any large
> workflow that uses Hadoop somewhere in its API stack needs two things.
>  First any call to Hadoop needs to expose to the caller some kind of handle
> / identifier to the hadoop job that was launched.  This is because the
> caller should be able to monitor the hadoop job, provide status and
> feedback to the users, troubleshoot, etc, any kind of long running process.
>  And if the Mahout API call invokes multiple Hadoop jobs in a row, as often
> is the case in Mahout, the caller needs to be able to gain access to each
> of hadoop job ids as they become available.  The second thing is any
> blocking long running API call needs to expose the option to run the call
> asynchronously (and provided hadoop job ids as the hadoop jobs get
> invoked).
>
> Take for example, the LSA algorithm.  Its not unreasonable to say that
> calling LDADriver.run() could start a chain of N mapreduce jobs that could
> take 8 hours to complete, given a large enough corpus of documents and
> large enough number of iterations.  In trying to integrate this into a
> workflow application I have to design my app knowing that every time it
> calls LDADriver.run() it could potentially block the process from several
> hours to several days, with now way to inspect the progress of what is
> happening.  The core problems are; my app has no idea how long its going to
> block, how far along the blocked process is, if any of the mapreduce jobs
> failed, and if they did fail which mapreduce jobs are associated with the
> what call to LDADriver.run().
>
> But if all algorithm API calls allowed me to invoke them asynchronously,
> and provided me with an object that I could use to track what is going on
> in Hadoop, such as a realtime updated list of job ids for example (an
> eventing mechanism when new job ids are added would be nice, but not a
> must), it would go a long way in easing the barrier to entry of integrating
> Mahout into commercial applications.
>

+1  I like this idea: synchronously return a handle to a MahoutStatus
object,
which you can poll for current status, current paths to output stuff, even
handles to intermediate state (and eventually final state), that would
be awesome.  I like this, it's totally pro-style, unlike what we have now.


> One last thing: I'd like to see Mahout getting away from using static
> functions so much.  I don't really have a non-religious reason for this,
> other than to say that I find when people use API's that are very static
> function heavy they tend to write their own code in the same way, and you
> end up with 1000 line monolithic functions being invoked from main()
> functions, which is never a good thing.
>

Agreed, big-time.  Static functions actually *are* the devil, for the most
part.  I actually do subscribe to that religion, but I haven't been to
church in a long time.  Mea culpa?


> Is that too much to ask?  :)
>

Not at all.

  -jake


>
> On Mon, Feb 13, 2012 at 11:11 AM, Jake Mannix <jake.mannix@gmail.com>
> wrote:
>
> > Hi John,
> >
> >  This is some very good feedback, and warrants serious discussion.  In
> > spite
> > of this, I'm going to respond on the fly with some thoughts in this vein.
> >
> >  We use Mahout at Twitter (the LDA stuff recently put in, and
> > mahout-collections
> > in various places, among other things) in production, and we use it,
> > actually,
> > via command-line invocations of the $MAHOUT_HOME/bin/mahout shell
> > script.  It's invoked in an environment where we keep all of the
> parameters
> > passed in in various (revision controlled) config files and the inputs
> are
> > produced
> > from a series Pig jobs which are invoked in similar ways, and the outputs
> > on
> > HDFS are loaded by various and sundry processes in their own ways.
> >
> >  So in general, I totally agree with you that having production *java*
> > apps call
> > into main() methods of other classes is extremely ugly and error-prone.
> > So
> > how would it look to interact via a nice java API to a system which was
> > going
> > to launch some (possibly iterative series of) MapReduce jobs?
> >
> >  I guess I can see how this would go: DistributedLanczosSolver, for
> example
> > can be run without the main() method:
> >
> > public int run(Path inputPath,
> >                 Path outputPath,
> >                 Path outputTmpPath,
> >                 Path workingDirPath,
> >                 int numRows,
> >                 int numCols,
> >                 boolean isSymmetric,
> >                 int desiredRank)
> >
> > is something you could run right after instantiating a
> > DistributedLanczosSolver and
> > .setConf()'ing it.
> >
> > So is that the kind of thing we'd want more of?  Or are you thinking of
> > something
> > nicer, where instead of just a response code, you get handles on java
> > objects which
> > are pointing to the output data sets in some way?  I suppose it's not
> > terribly hard
> > to just do
> >
> >  DistributedRowMatrix outputData =
> >     new DRM(outputPath, myTmpPath, numRows, numCols);
> >
> > after running another job, but maybe it would be even nicer to return a
> > struct-like
> > thing which has all the relevant output data as java objects.
> >
> > Another thing would be making sure that running these classes didn't
> > require
> > such long method argument lists - builders to the rescue!
> >
> >  -jake
> >
> >
> > On Mon, Feb 13, 2012 at 9:31 AM, John Conwell <john@iamjohn.me> wrote:
> >
> > > From my perspective, I'd really like to see the Mahout API migrate away
> > > from a command line centric design it currently utilizes, and migrate
> > more
> > > towards an library centric API design.  I think this would go a long
> way
> > in
> > > getting Mahout adopted into real life commercial applications.
> > >
> > > While there might be a few algorithm drivers that you interact with by
> > > creating an instance of a class, and calling some method(s) on the
> > instance
> > > to interact with it (I havent actually seen one like that, but there
> > might
> > > be a few), many algorithms are invoked by calling some static function
> > on a
> > > class that takes ~37 typed arguments.  Buts whats worse, many drivers
> are
> > > invoked by having to create a String array with ~37 arguments as string
> > > values, and calling the static main function on the class.
> > >
> > > Now I'm not saying that having a static main function available to
> invoke
> > > an algorithm from the command line isn't useful.  It is, when your
> > testing
> > > an algorithm.  But once you want to integrate the algorithm into a
> > > commercial workflow it kind of sucks.
> > >
> > > For example, immagine if the API for invoking Math.max was designed the
> > way
> > > many of the Mahout algorithms currently are?  You'd have something like
> > > this:
> > >
> > > String[] args = new String[2];
> > > args[0] = "max";
> > > args[1] = "7";
> > > args[0] = "4";
> > > int max = Math.main(args);
> > >
> > > It makes your code a horrible mess and very hard to maintain, as well
> as
> > > very prone to bugs.
> > >
> > > When I see a bunch of static main functions as the only way to interact
> > > with a library, no matter what the quality of the library is, my
> initial
> > > impression is that this has to be some minimally supported effort by a
> > few
> > > PhD candidates still in academia, who will drop the project as soon as
> > they
> > > graduate.  And while this might not be the case, it is one of the first
> > > impressions it gives, and can lead a company to drop the library from
> > > consideration before they do any due diligence into its quality and
> > > utility.
> > >
> > > I think as Mahout matures and gets closer to a 1.0 release, this kind
> of
> > > API re-design will become more and more necessary, especially if you
> > want a
> > > higher Mahout integration rate into commercial applications and
> > workflows.
> > >
> > > Also, I hope I dont sound too negative.  I'm very impressed with Mahout
> > and
> > > its capabilities.  I really like that there is a well thought out class
> > > library of primitives for designing new serial and distributed machine
> > > learning algorithms.  And I think it has a high utility for integration
> > > into highly visible commercial projects.  But its high level public API
> > > really is a barrier to entry when trying to design commercial
> > applications.
> > >
> > >
> > > On Sun, Feb 12, 2012 at 12:20 AM, Jeff Eastman
> > > <jdog@windwardsolutions.com>wrote:
> > >
> > > > We have a couple JIRAs that relate here: We want to factor all the
> > (-cl)
> > > > classification steps out of all of the driver classes (MAHOUT-930)
> and
> > > into
> > > > a separate job to remove duplicated code; MAHOUT-931 is to add a
> > > pluggable
> > > > outlier removal capability to this job; and MAHOUT-933 is aimed at
> > > > factoring all the iteration mechanics from each driver class into the
> > > > ClusterIterator, which uses a ClusterClassifier which is itself an
> > > > OnlineLearner. This will hopefully allow semi-supervised classifier
> > > > applications to be constructed by feeding cluster-derived models into
> > the
> > > > classification process. Still kind of fuzzy at this point but
> promising
> > > too.
> > > >
> > > > On 2/11/12 2:29 PM, Frank Scholten wrote:
> > > >
> > > >> ...
> > > >>
> > > >> What kind of clustering refactoring do mean here? I did some work
on
> > > >> creating bean configurations in the past (MAHOUT-612). I
> > underestimated
> > > the
> > > >> amount of work required to do the entire refactoring. If this can
be
> > > >> contributed and committed on a per-job basis I would like to help
> out.
> > > >>
> > > >>> ...
> > > >>>
> > > >>
> > > >>
> > > >
> > >
> > >
> > > --
> > >
> > > Thanks,
> > > John C
> > >
> >
>
>
>
> --
>
> Thanks,
> John C
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message