Return-Path: X-Original-To: apmail-mahout-user-archive@www.apache.org Delivered-To: apmail-mahout-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 557799726 for ; Wed, 22 Feb 2012 18:45:48 +0000 (UTC) Received: (qmail 51690 invoked by uid 500); 22 Feb 2012 18:45:47 -0000 Delivered-To: apmail-mahout-user-archive@mahout.apache.org Received: (qmail 51633 invoked by uid 500); 22 Feb 2012 18:45:47 -0000 Mailing-List: contact user-help@mahout.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@mahout.apache.org Delivered-To: mailing list user@mahout.apache.org Received: (qmail 51624 invoked by uid 99); 22 Feb 2012 18:45:46 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 22 Feb 2012 18:45:46 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=FSL_RCVD_USER,HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of jake.mannix@gmail.com designates 209.85.220.170 as permitted sender) Received: from [209.85.220.170] (HELO mail-vx0-f170.google.com) (209.85.220.170) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 22 Feb 2012 18:45:39 +0000 Received: by vcbfk13 with SMTP id fk13so580853vcb.1 for ; Wed, 22 Feb 2012 10:45:19 -0800 (PST) Received-SPF: pass (google.com: domain of jake.mannix@gmail.com designates 10.52.20.78 as permitted sender) client-ip=10.52.20.78; Authentication-Results: mr.google.com; spf=pass (google.com: domain of jake.mannix@gmail.com designates 10.52.20.78 as permitted sender) smtp.mail=jake.mannix@gmail.com; dkim=pass header.i=jake.mannix@gmail.com Received: from mr.google.com ([10.52.20.78]) by 10.52.20.78 with SMTP id l14mr16845354vde.62.1329936319166 (num_hops = 1); Wed, 22 Feb 2012 10:45:19 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :content-type; bh=7vlA+pFGnG765ao05zWiwDzEhwJT8VPxQXOn3KZ1ni8=; b=T2e2AJyL59eM+Qo46XSREzzCAc5v2cXWU8x5/C7druOwuR2RmsQYzpjmvGjuf8RjFa RF5WWSfQiz0QE4Aiq/erotH8vsUfLRwniq8mAgne2I4GzlMPAGoI5i0wc7U1fWif1tXo 2IygPBId9IlId7pm3lsD7RNwLltSduRQmQkPo= Received: by 10.52.20.78 with SMTP id l14mr13695341vde.62.1329936319081; Wed, 22 Feb 2012 10:45:19 -0800 (PST) MIME-Version: 1.0 Received: by 10.52.71.207 with HTTP; Wed, 22 Feb 2012 10:44:59 -0800 (PST) In-Reply-To: References: <4F36BB10.4020105@windwardsolutions.com> <4F377657.3010800@windwardsolutions.com> From: Jake Mannix Date: Wed, 22 Feb 2012 10:44:59 -0800 Message-ID: Subject: Re: Goals for Mahout 0.7 To: user@mahout.apache.org Content-Type: multipart/alternative; boundary=20cf307d013adf6cc104b991ed83 X-Virus-Checked: Checked by ClamAV on apache.org --20cf307d013adf6cc104b991ed83 Content-Type: text/plain; charset=ISO-8859-1 On Wed, Feb 22, 2012 at 10:00 AM, John Conwell wrote: > I've been meaning to respond with my thoughts to this (though it took me > almost two weeks to get around to it). > > Jake, your example of the DistributedLanczosSolver in how to interact with > the different algorithms is along the lines of what I was thinking, at > least as a bare minimum. I'm a huge fan of using Builder classes for these > types of scenarios, but I do understand that they are a pain to write, so > in the short term to get all the algorithms API friendly by just having run > functions with typed arguments is fine. Anything to get rid of my String[] > args variables I'm creating and passing around. > > You also mention the output to the algorithm APIs. I'm not a big fan of > the returned 1 or 0 response codes. Seeing that sends me into COM hResult > PTSD invoked panic attacks (NOTE: I'm not making light of PTSD). Except > its worse than hResults, because at least there were multiple hResults > values that theoretically I could look up to figure out the actual problem > that occurred. > > If I had my way, I would want the API output to return me two things: > handles/objects that point to all the generated output of the algorithm > (like you mentioned), and an object that gives me all the information I > need to track the Hadoop mapreduce jobs that were invoked by the API call. > > The first one is a nice to have. Since I most likely pass in a Path object > to where I want the output to go, I know where the output is, and I should > be able to infer what type of data it is, and so forth. Having output > handles to this data would be really nice, and make integrating Mahout into > larger workflows much easier, but its not a show stopper. > > But the second one is VERY important and can be a show stopper. Any large > workflow that uses Hadoop somewhere in its API stack needs two things. > First any call to Hadoop needs to expose to the caller some kind of handle > / identifier to the hadoop job that was launched. This is because the > caller should be able to monitor the hadoop job, provide status and > feedback to the users, troubleshoot, etc, any kind of long running process. > And if the Mahout API call invokes multiple Hadoop jobs in a row, as often > is the case in Mahout, the caller needs to be able to gain access to each > of hadoop job ids as they become available. The second thing is any > blocking long running API call needs to expose the option to run the call > asynchronously (and provided hadoop job ids as the hadoop jobs get > invoked). > > Take for example, the LSA algorithm. Its not unreasonable to say that > calling LDADriver.run() could start a chain of N mapreduce jobs that could > take 8 hours to complete, given a large enough corpus of documents and > large enough number of iterations. In trying to integrate this into a > workflow application I have to design my app knowing that every time it > calls LDADriver.run() it could potentially block the process from several > hours to several days, with now way to inspect the progress of what is > happening. The core problems are; my app has no idea how long its going to > block, how far along the blocked process is, if any of the mapreduce jobs > failed, and if they did fail which mapreduce jobs are associated with the > what call to LDADriver.run(). > > But if all algorithm API calls allowed me to invoke them asynchronously, > and provided me with an object that I could use to track what is going on > in Hadoop, such as a realtime updated list of job ids for example (an > eventing mechanism when new job ids are added would be nice, but not a > must), it would go a long way in easing the barrier to entry of integrating > Mahout into commercial applications. > +1 I like this idea: synchronously return a handle to a MahoutStatus object, which you can poll for current status, current paths to output stuff, even handles to intermediate state (and eventually final state), that would be awesome. I like this, it's totally pro-style, unlike what we have now. > One last thing: I'd like to see Mahout getting away from using static > functions so much. I don't really have a non-religious reason for this, > other than to say that I find when people use API's that are very static > function heavy they tend to write their own code in the same way, and you > end up with 1000 line monolithic functions being invoked from main() > functions, which is never a good thing. > Agreed, big-time. Static functions actually *are* the devil, for the most part. I actually do subscribe to that religion, but I haven't been to church in a long time. Mea culpa? > Is that too much to ask? :) > Not at all. -jake > > On Mon, Feb 13, 2012 at 11:11 AM, Jake Mannix > wrote: > > > Hi John, > > > > This is some very good feedback, and warrants serious discussion. In > > spite > > of this, I'm going to respond on the fly with some thoughts in this vein. > > > > We use Mahout at Twitter (the LDA stuff recently put in, and > > mahout-collections > > in various places, among other things) in production, and we use it, > > actually, > > via command-line invocations of the $MAHOUT_HOME/bin/mahout shell > > script. It's invoked in an environment where we keep all of the > parameters > > passed in in various (revision controlled) config files and the inputs > are > > produced > > from a series Pig jobs which are invoked in similar ways, and the outputs > > on > > HDFS are loaded by various and sundry processes in their own ways. > > > > So in general, I totally agree with you that having production *java* > > apps call > > into main() methods of other classes is extremely ugly and error-prone. > > So > > how would it look to interact via a nice java API to a system which was > > going > > to launch some (possibly iterative series of) MapReduce jobs? > > > > I guess I can see how this would go: DistributedLanczosSolver, for > example > > can be run without the main() method: > > > > public int run(Path inputPath, > > Path outputPath, > > Path outputTmpPath, > > Path workingDirPath, > > int numRows, > > int numCols, > > boolean isSymmetric, > > int desiredRank) > > > > is something you could run right after instantiating a > > DistributedLanczosSolver and > > .setConf()'ing it. > > > > So is that the kind of thing we'd want more of? Or are you thinking of > > something > > nicer, where instead of just a response code, you get handles on java > > objects which > > are pointing to the output data sets in some way? I suppose it's not > > terribly hard > > to just do > > > > DistributedRowMatrix outputData = > > new DRM(outputPath, myTmpPath, numRows, numCols); > > > > after running another job, but maybe it would be even nicer to return a > > struct-like > > thing which has all the relevant output data as java objects. > > > > Another thing would be making sure that running these classes didn't > > require > > such long method argument lists - builders to the rescue! > > > > -jake > > > > > > On Mon, Feb 13, 2012 at 9:31 AM, John Conwell wrote: > > > > > From my perspective, I'd really like to see the Mahout API migrate away > > > from a command line centric design it currently utilizes, and migrate > > more > > > towards an library centric API design. I think this would go a long > way > > in > > > getting Mahout adopted into real life commercial applications. > > > > > > While there might be a few algorithm drivers that you interact with by > > > creating an instance of a class, and calling some method(s) on the > > instance > > > to interact with it (I havent actually seen one like that, but there > > might > > > be a few), many algorithms are invoked by calling some static function > > on a > > > class that takes ~37 typed arguments. Buts whats worse, many drivers > are > > > invoked by having to create a String array with ~37 arguments as string > > > values, and calling the static main function on the class. > > > > > > Now I'm not saying that having a static main function available to > invoke > > > an algorithm from the command line isn't useful. It is, when your > > testing > > > an algorithm. But once you want to integrate the algorithm into a > > > commercial workflow it kind of sucks. > > > > > > For example, immagine if the API for invoking Math.max was designed the > > way > > > many of the Mahout algorithms currently are? You'd have something like > > > this: > > > > > > String[] args = new String[2]; > > > args[0] = "max"; > > > args[1] = "7"; > > > args[0] = "4"; > > > int max = Math.main(args); > > > > > > It makes your code a horrible mess and very hard to maintain, as well > as > > > very prone to bugs. > > > > > > When I see a bunch of static main functions as the only way to interact > > > with a library, no matter what the quality of the library is, my > initial > > > impression is that this has to be some minimally supported effort by a > > few > > > PhD candidates still in academia, who will drop the project as soon as > > they > > > graduate. And while this might not be the case, it is one of the first > > > impressions it gives, and can lead a company to drop the library from > > > consideration before they do any due diligence into its quality and > > > utility. > > > > > > I think as Mahout matures and gets closer to a 1.0 release, this kind > of > > > API re-design will become more and more necessary, especially if you > > want a > > > higher Mahout integration rate into commercial applications and > > workflows. > > > > > > Also, I hope I dont sound too negative. I'm very impressed with Mahout > > and > > > its capabilities. I really like that there is a well thought out class > > > library of primitives for designing new serial and distributed machine > > > learning algorithms. And I think it has a high utility for integration > > > into highly visible commercial projects. But its high level public API > > > really is a barrier to entry when trying to design commercial > > applications. > > > > > > > > > On Sun, Feb 12, 2012 at 12:20 AM, Jeff Eastman > > > wrote: > > > > > > > We have a couple JIRAs that relate here: We want to factor all the > > (-cl) > > > > classification steps out of all of the driver classes (MAHOUT-930) > and > > > into > > > > a separate job to remove duplicated code; MAHOUT-931 is to add a > > > pluggable > > > > outlier removal capability to this job; and MAHOUT-933 is aimed at > > > > factoring all the iteration mechanics from each driver class into the > > > > ClusterIterator, which uses a ClusterClassifier which is itself an > > > > OnlineLearner. This will hopefully allow semi-supervised classifier > > > > applications to be constructed by feeding cluster-derived models into > > the > > > > classification process. Still kind of fuzzy at this point but > promising > > > too. > > > > > > > > On 2/11/12 2:29 PM, Frank Scholten wrote: > > > > > > > >> ... > > > >> > > > >> What kind of clustering refactoring do mean here? I did some work on > > > >> creating bean configurations in the past (MAHOUT-612). I > > underestimated > > > the > > > >> amount of work required to do the entire refactoring. If this can be > > > >> contributed and committed on a per-job basis I would like to help > out. > > > >> > > > >>> ... > > > >>> > > > >> > > > >> > > > > > > > > > > > > > -- > > > > > > Thanks, > > > John C > > > > > > > > > -- > > Thanks, > John C > --20cf307d013adf6cc104b991ed83--