Return-Path: Delivered-To: apmail-lucene-mahout-dev-archive@minotaur.apache.org Received: (qmail 34482 invoked from network); 19 Mar 2010 12:34:31 -0000 Received: from unknown (HELO mail.apache.org) (140.211.11.3) by 140.211.11.9 with SMTP; 19 Mar 2010 12:34:31 -0000 Received: (qmail 43097 invoked by uid 500); 19 Mar 2010 12:34:31 -0000 Delivered-To: apmail-lucene-mahout-dev-archive@lucene.apache.org Received: (qmail 43004 invoked by uid 500); 19 Mar 2010 12:34:31 -0000 Mailing-List: contact mahout-dev-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: mahout-dev@lucene.apache.org Delivered-To: mailing list mahout-dev@lucene.apache.org Received: (qmail 42991 invoked by uid 99); 19 Mar 2010 12:34:31 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 19 Mar 2010 12:34:31 +0000 X-ASF-Spam-Status: No, hits=-1.0 required=10.0 tests=AWL,FREEMAIL_FROM,SPF_PASS,T_TO_NO_BRKTS_FREEMAIL X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of gsiasf@gmail.com designates 209.85.160.176 as permitted sender) Received: from [209.85.160.176] (HELO mail-gy0-f176.google.com) (209.85.160.176) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 19 Mar 2010 12:34:23 +0000 Received: by gyd8 with SMTP id 8so1542510gyd.35 for ; Fri, 19 Mar 2010 05:34:03 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:received:received:sender:content-type :mime-version:subject:from:in-reply-to:date :content-transfer-encoding:message-id:references:to:x-mailer; bh=jzOexZaX72aEW1xoM+KCZUoCKasAekjEd33kOoHj7tM=; b=NSmKCcAJ+ohiXSjcDwrtB4t1Vevxrd0eoZPrJO2TAgBuxkDpSMaEBfvmiHBjAAal4j oUQYid4L+A+sgi3kBOXao1qiS/V4vakFdmCqqHhC6X5FdGKw6hPnYZiYhSwnVGIRyDYu ePUJeefMez2z9vwt6YM4m04K0gl0TXDyPTyYk= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=sender:content-type:mime-version:subject:from:in-reply-to:date :content-transfer-encoding:message-id:references:to:x-mailer; b=qITGLGYBs8Ba1x0VhJ8wg6pXEFNGf26oWiXRlYXSI18i982Y4CgCLdg9PLD9COvJY2 2pLnx1UvAGXK9ISbeoha8es5DYkumFbYi2sDOPeH5XyNGKa1zUkuuSNMBnKObCWqV3cO EK++I+nULsUA31JqZBR7akQzzDy2Ie6GIkpts= Received: by 10.101.108.5 with SMTP id k5mr7043866anm.122.1269002042862; Fri, 19 Mar 2010 05:34:02 -0700 (PDT) Received: from [10.0.0.77] (adsl-065-013-152-164.sip.rdu.bellsouth.net [65.13.152.164]) by mx.google.com with ESMTPS id 15sm785207gxk.2.2010.03.19.05.34.01 (version=TLSv1/SSLv3 cipher=RC4-MD5); Fri, 19 Mar 2010 05:34:02 -0700 (PDT) Sender: Grant Ingersoll Content-Type: text/plain; charset=us-ascii Mime-Version: 1.0 (Apple Message framework v1077) Subject: Re: The new improved command-line: MahoutDriver (get it?) From: Grant Ingersoll In-Reply-To: <4b124c311003021210t7488198nc7fb6b419f7498e3@mail.gmail.com> Date: Fri, 19 Mar 2010 08:34:00 -0400 Content-Transfer-Encoding: quoted-printable Message-Id: References: <4b124c311003021210t7488198nc7fb6b419f7498e3@mail.gmail.com> To: mahout-dev@lucene.apache.org X-Mailer: Apple Mail (2.1077) RE: MahoutDriver (get it?) Perhaps we should have called it MahoutMahout? At any rate, very good stuff. On Mar 2, 2010, at 3:10 PM, Jake Mannix wrote: > Hey all, >=20 > Just an update on the new-and-improved command-line "UI" we have now. > After a ton of iterations back and forth with Drew (thanks!), = MAHOUT-301 > has been committed, and brings with it the easy ability to trim down = your > long long command lines for most of our *Driver main() methods, by = saving > your default command-line arguments for various drivers in properties = files > (which are then overridable via the command line), either locally or = on > hadoop. Feature-set is as follows (usage after that): >=20 > Either from the binary distribution or from source (after having done = "mvn > install", naturally), this is the setup - there are a bunch of = properties > files with a kludgey format (because I didn't want to dig into the xml > rathole, and while a nice flexible schema is nice, I opted to follow = the > YAGNI principle) : >=20 > *) there is a new directory "conf" at the top level (of the binary = dist, > as well as source), which contains a bunch of *.props files: one = special one > called driver.classes.props, which has the mapping between (the keys) > fully-qualified class name of a class which has a main() method, and = the > "short-name" (the values) and brief description. The current file is = just > the following: >=20 > ### > org.apache.mahout.utils.vectors.VectorDumper =3D vectordump : Dump = vectors > from a sequence file to text > org.apache.mahout.utils.clustering.ClusterDumper =3D clusterdump : = Dump > cluster output to text > org.apache.mahout.utils.SequenceFileDumper =3D seqdumper : Generic = Sequence > File dumper > org.apache.mahout.clustering.kmeans.KMeansDriver =3D kmeans : K-means > clustering > org.apache.mahout.clustering.fuzzykmeans.FuzzyKMeansDriver =3D fkmeans = : Fuzzy > K-means clustering > org.apache.mahout.clustering.lda.LDADriver =3D lda : Latent Dirchlet > Allocation > org.apache.mahout.fpm.pfpgrowth.FPGrowthDriver =3D fpg : Frequent = Pattern > Growth > org.apache.mahout.clustering.dirichlet.DirichletDriver =3D dirichlet : > Dirichlet Clustering > org.apache.mahout.clustering.meanshift.MeanShiftCanopyDriver =3D = meanshift : > Mean Shift clustering > org.apache.mahout.clustering.canopy.CanopyDriver =3D canopy : Canopy > clustering > org.apache.mahout.utils.vectors.lucene.Driver =3D lucene.vector : = Generate > Vectors from a Lucene index > org.apache.mahout.text.SequenceFilesFromDirectory =3D seqdirectory : = Generate > sequence files (of Text) from a directory > org.apache.mahout.text.SparseVectorsFromSequenceFiles =3D seq2sparse: = Sparse > Vector generation from Text sequence files > org.apache.mahout.text.WikipediaToSequenceFile =3D seqwiki : Wikipedia = xml > dump to sequence file > org.apache.mahout.classifier.bayes.TestClassifier =3D testclassifier : = Test > Bayes Classifier > org.apache.mahout.classifier.bayes.TrainClassifier =3D trainclassifier = : Train > Bayes Classifier > org.apache.mahout.math.hadoop.decomposer.DistributedLanczosSolver =3D = svd : > Lanczos Singular Value Decomposition > org.apache.mahout.math.hadoop.decomposer.EigenVerificationJob =3D = cleansvd : > Cleanup and verification of SVD output > ### >=20 > It's meant to be read into java.util.Properties, where the values on = the > right hand side are further split by ":" into the short-name (to be = used on > the command-line) and the description (printed to stdout if an invalid = input > is made or "-h" is used with no class name to run). *If there are = missing > classes from this list, please add them!* >=20 > *) there are also a bunch of files in conf/ which are named > .props, where is one of the = driver.classes.props > above. These files are mostly empty now (well, commented out), but = for > example, conf/svd.props is currently: >=20 > #i|input =3D > #o|output =3D > #nr|numRows =3D > #nc|numCols =3D > #r|rank =3D > #t|tempDir =3D >=20 > the format of these props files is that the key is of the form > "singleDashCmdLineOpt|doubleDashCmdLineOpt", (if there is no "|" in = the key, > the short and long form will be assumed to be the same) and the value = is > whatever you would want that option to be (does not currently support > options with no value, this is a TODO). So for example if you had a > command line such as: >=20 >> $MAHOUT_HOME/bin/mahout svd --input /path/to/input -o /path/to/output = -nr > --numCols -r -t /tmp/svd >=20 > You could just uncomment the lines in conf/svd.conf as >=20 > i|input =3D /path/to/input > o|output =3D /path/to/output > nr|numRows =3D > nc|numCols =3D > r|rank =3D > t|tempDir =3D /tmp/svd >=20 > and run as >=20 >> $MAHOUT_HOME/bin/mahout svd >=20 > If you wanted to run a second time, but you didn't want to overwrite = your > old results, you could then do >=20 >> $MAHOUT_HOME/bin/mahout svd -o /path/to/newOutput >=20 > which would override /path/to/output and instead use = /path/to/newOutput, > with all the other properties coming from the svd.props. >=20 > *) the $MAHOUT_HOME/conf directory is just a template - the mahout = shell > script adds $MAHOUT_CONF_DIR to the classpath (or $MAHOUT_HOME/conf if > $MAHOUT_CONF_DIR is not defined), and MahoutDriver reads the = properties > files from the classpath. >=20 > *) running on Hadoop: if your $HADOOP_HOME and $HADOOP_CONF_DIR are = set, > the mahout shell script automatically launches your requested main = method to > your hadoop cluster, otherwise it's run locally. >=20 > *) if your main() isn't defined in driver.classes.properties, that's = ok, > it'll still run via: >=20 >=20 > $MAHOUT_HOME/bin/mahout org.apache.mahout.blah.blah.SomeOtherDriver = [remaining > args] >=20 > and in fact, if you put = "org.apache.mahout.blah.blah.SomeOtherDriver.props" > on your classpath, and has the format for the .props listed > above, it will be used for default properties for this class. >=20 > -------- >=20 > I'll put this up in some nicer form for the wiki in the next couple of = days. >=20 >=20 > Try out various driver classes that you use - we all use different = ones, so > getting some dev/user manual test coverage would be nice, because it's = kinda > tricky to unit test shell scripts and command line args and env = variables > (and running on a real cluster, etc...). We should try to fix any = bugs > before release. >=20 > Feedback welcome. It's hacky, but it adds some useful functionality, = and we > can clean up the props-file syntax (or ditch it for = xml/yaml/json/whatever) > as needed later. >=20 > -jake -------------------------- Grant Ingersoll http://www.lucidimagination.com/ Search the Lucene ecosystem using Solr/Lucene: = http://www.lucidimagination.com/search