Return-Path: X-Original-To: apmail-mahout-user-archive@www.apache.org Delivered-To: apmail-mahout-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id ACF7F9B2D for ; Mon, 13 Feb 2012 19:11:48 +0000 (UTC) Received: (qmail 92388 invoked by uid 500); 13 Feb 2012 19:11:47 -0000 Delivered-To: apmail-mahout-user-archive@mahout.apache.org Received: (qmail 92277 invoked by uid 500); 13 Feb 2012 19:11:46 -0000 Mailing-List: contact user-help@mahout.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@mahout.apache.org Delivered-To: mailing list user@mahout.apache.org Received: (qmail 92269 invoked by uid 99); 13 Feb 2012 19:11:46 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 13 Feb 2012 19:11:46 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of jake.mannix@gmail.com designates 209.85.220.170 as permitted sender) Received: from [209.85.220.170] (HELO mail-vx0-f170.google.com) (209.85.220.170) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 13 Feb 2012 19:11:41 +0000 Received: by vcbfk13 with SMTP id fk13so7919840vcb.1 for ; Mon, 13 Feb 2012 11:11:20 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :content-type; bh=t0MBOvaEzs12vmEUb9L4zdOyLrRIAbijFkcH/O2OagI=; b=PpPJZcrMHE9nMoosxgnHtHqNHcu63CDW/pkVo6USdko3QPrvF2AwdO0FPGNlXEA9pV HAr33OhsIRF49ZcE+SnJznLEQSuU0qHQFAJ615Ktm02QaNKRDYrujkVmydAXr+bRaVQw L+OWgZqBK+QoqUGDzgT+70JXdmnptuqubkoBI= Received: by 10.52.34.97 with SMTP id y1mr4012332vdi.69.1329160280383; Mon, 13 Feb 2012 11:11:20 -0800 (PST) MIME-Version: 1.0 Received: by 10.52.91.171 with HTTP; Mon, 13 Feb 2012 11:11:00 -0800 (PST) In-Reply-To: References: <4F36BB10.4020105@windwardsolutions.com> <4F377657.3010800@windwardsolutions.com> From: Jake Mannix Date: Mon, 13 Feb 2012 11:11:00 -0800 Message-ID: Subject: Re: Goals for Mahout 0.7 To: user@mahout.apache.org Content-Type: multipart/alternative; boundary=20cf30780ce05ca1a704b8dd3ee8 --20cf30780ce05ca1a704b8dd3ee8 Content-Type: text/plain; charset=ISO-8859-1 Hi John, This is some very good feedback, and warrants serious discussion. In spite of this, I'm going to respond on the fly with some thoughts in this vein. We use Mahout at Twitter (the LDA stuff recently put in, and mahout-collections in various places, among other things) in production, and we use it, actually, via command-line invocations of the $MAHOUT_HOME/bin/mahout shell script. It's invoked in an environment where we keep all of the parameters passed in in various (revision controlled) config files and the inputs are produced from a series Pig jobs which are invoked in similar ways, and the outputs on HDFS are loaded by various and sundry processes in their own ways. So in general, I totally agree with you that having production *java* apps call into main() methods of other classes is extremely ugly and error-prone. So how would it look to interact via a nice java API to a system which was going to launch some (possibly iterative series of) MapReduce jobs? I guess I can see how this would go: DistributedLanczosSolver, for example can be run without the main() method: public int run(Path inputPath, Path outputPath, Path outputTmpPath, Path workingDirPath, int numRows, int numCols, boolean isSymmetric, int desiredRank) is something you could run right after instantiating a DistributedLanczosSolver and .setConf()'ing it. So is that the kind of thing we'd want more of? Or are you thinking of something nicer, where instead of just a response code, you get handles on java objects which are pointing to the output data sets in some way? I suppose it's not terribly hard to just do DistributedRowMatrix outputData = new DRM(outputPath, myTmpPath, numRows, numCols); after running another job, but maybe it would be even nicer to return a struct-like thing which has all the relevant output data as java objects. Another thing would be making sure that running these classes didn't require such long method argument lists - builders to the rescue! -jake On Mon, Feb 13, 2012 at 9:31 AM, John Conwell wrote: > From my perspective, I'd really like to see the Mahout API migrate away > from a command line centric design it currently utilizes, and migrate more > towards an library centric API design. I think this would go a long way in > getting Mahout adopted into real life commercial applications. > > While there might be a few algorithm drivers that you interact with by > creating an instance of a class, and calling some method(s) on the instance > to interact with it (I havent actually seen one like that, but there might > be a few), many algorithms are invoked by calling some static function on a > class that takes ~37 typed arguments. Buts whats worse, many drivers are > invoked by having to create a String array with ~37 arguments as string > values, and calling the static main function on the class. > > Now I'm not saying that having a static main function available to invoke > an algorithm from the command line isn't useful. It is, when your testing > an algorithm. But once you want to integrate the algorithm into a > commercial workflow it kind of sucks. > > For example, immagine if the API for invoking Math.max was designed the way > many of the Mahout algorithms currently are? You'd have something like > this: > > String[] args = new String[2]; > args[0] = "max"; > args[1] = "7"; > args[0] = "4"; > int max = Math.main(args); > > It makes your code a horrible mess and very hard to maintain, as well as > very prone to bugs. > > When I see a bunch of static main functions as the only way to interact > with a library, no matter what the quality of the library is, my initial > impression is that this has to be some minimally supported effort by a few > PhD candidates still in academia, who will drop the project as soon as they > graduate. And while this might not be the case, it is one of the first > impressions it gives, and can lead a company to drop the library from > consideration before they do any due diligence into its quality and > utility. > > I think as Mahout matures and gets closer to a 1.0 release, this kind of > API re-design will become more and more necessary, especially if you want a > higher Mahout integration rate into commercial applications and workflows. > > Also, I hope I dont sound too negative. I'm very impressed with Mahout and > its capabilities. I really like that there is a well thought out class > library of primitives for designing new serial and distributed machine > learning algorithms. And I think it has a high utility for integration > into highly visible commercial projects. But its high level public API > really is a barrier to entry when trying to design commercial applications. > > > On Sun, Feb 12, 2012 at 12:20 AM, Jeff Eastman > wrote: > > > We have a couple JIRAs that relate here: We want to factor all the (-cl) > > classification steps out of all of the driver classes (MAHOUT-930) and > into > > a separate job to remove duplicated code; MAHOUT-931 is to add a > pluggable > > outlier removal capability to this job; and MAHOUT-933 is aimed at > > factoring all the iteration mechanics from each driver class into the > > ClusterIterator, which uses a ClusterClassifier which is itself an > > OnlineLearner. This will hopefully allow semi-supervised classifier > > applications to be constructed by feeding cluster-derived models into the > > classification process. Still kind of fuzzy at this point but promising > too. > > > > On 2/11/12 2:29 PM, Frank Scholten wrote: > > > >> ... > >> > >> What kind of clustering refactoring do mean here? I did some work on > >> creating bean configurations in the past (MAHOUT-612). I underestimated > the > >> amount of work required to do the entire refactoring. If this can be > >> contributed and committed on a per-job basis I would like to help out. > >> > >>> ... > >>> > >> > >> > > > > > -- > > Thanks, > John C > --20cf30780ce05ca1a704b8dd3ee8--