mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From John Conwell <j...@iamjohn.me>
Subject Re: Goals for Mahout 0.7
Date Mon, 13 Feb 2012 17:31:59 GMT
>From my perspective, I'd really like to see the Mahout API migrate away
from a command line centric design it currently utilizes, and migrate more
towards an library centric API design.  I think this would go a long way in
getting Mahout adopted into real life commercial applications.

While there might be a few algorithm drivers that you interact with by
creating an instance of a class, and calling some method(s) on the instance
to interact with it (I havent actually seen one like that, but there might
be a few), many algorithms are invoked by calling some static function on a
class that takes ~37 typed arguments.  Buts whats worse, many drivers are
invoked by having to create a String array with ~37 arguments as string
values, and calling the static main function on the class.

Now I'm not saying that having a static main function available to invoke
an algorithm from the command line isn't useful.  It is, when your testing
an algorithm.  But once you want to integrate the algorithm into a
commercial workflow it kind of sucks.

For example, immagine if the API for invoking Math.max was designed the way
many of the Mahout algorithms currently are?  You'd have something like
this:

String[] args = new String[2];
args[0] = "max";
args[1] = "7";
args[0] = "4";
int max = Math.main(args);

It makes your code a horrible mess and very hard to maintain, as well as
very prone to bugs.

When I see a bunch of static main functions as the only way to interact
with a library, no matter what the quality of the library is, my initial
impression is that this has to be some minimally supported effort by a few
PhD candidates still in academia, who will drop the project as soon as they
graduate.  And while this might not be the case, it is one of the first
impressions it gives, and can lead a company to drop the library from
consideration before they do any due diligence into its quality and utility.

I think as Mahout matures and gets closer to a 1.0 release, this kind of
API re-design will become more and more necessary, especially if you want a
higher Mahout integration rate into commercial applications and workflows.

Also, I hope I dont sound too negative.  I'm very impressed with Mahout and
its capabilities.  I really like that there is a well thought out class
library of primitives for designing new serial and distributed machine
learning algorithms.  And I think it has a high utility for integration
into highly visible commercial projects.  But its high level public API
really is a barrier to entry when trying to design commercial applications.


On Sun, Feb 12, 2012 at 12:20 AM, Jeff Eastman
<jdog@windwardsolutions.com>wrote:

> We have a couple JIRAs that relate here: We want to factor all the (-cl)
> classification steps out of all of the driver classes (MAHOUT-930) and into
> a separate job to remove duplicated code; MAHOUT-931 is to add a pluggable
> outlier removal capability to this job; and MAHOUT-933 is aimed at
> factoring all the iteration mechanics from each driver class into the
> ClusterIterator, which uses a ClusterClassifier which is itself an
> OnlineLearner. This will hopefully allow semi-supervised classifier
> applications to be constructed by feeding cluster-derived models into the
> classification process. Still kind of fuzzy at this point but promising too.
>
> On 2/11/12 2:29 PM, Frank Scholten wrote:
>
>> ...
>>
>> What kind of clustering refactoring do mean here? I did some work on
>> creating bean configurations in the past (MAHOUT-612). I underestimated the
>> amount of work required to do the entire refactoring. If this can be
>> contributed and committed on a per-job basis I would like to help out.
>>
>>> ...
>>>
>>
>>
>


-- 

Thanks,
John C

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message