mahout-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Grant Ingersoll <gsing...@apache.org>
Subject Re: The new improved command-line: MahoutDriver (get it?)
Date Fri, 19 Mar 2010 12:34:00 GMT
RE: MahoutDriver (get it?)

Perhaps we should have called it MahoutMahout?

At any rate, very good stuff.

On Mar 2, 2010, at 3:10 PM, Jake Mannix wrote:

> Hey all,
> 
>  Just an update on the new-and-improved command-line "UI" we have now.
> After a ton of iterations back and forth with Drew (thanks!), MAHOUT-301
> has been committed, and brings with it the easy ability to trim down your
> long long command lines for most of our *Driver main() methods, by saving
> your default command-line arguments for various drivers in properties files
> (which are then overridable via the command line), either locally or on
> hadoop.  Feature-set is as follows (usage after that):
> 
>  Either from the binary distribution or from source (after having done "mvn
> install", naturally), this is the setup - there are a bunch of properties
> files with a kludgey format (because I didn't want to dig into the xml
> rathole, and while a nice flexible schema is nice, I opted to follow the
> YAGNI principle) :
> 
>  *) there is a new directory "conf" at the top level (of the binary dist,
> as well as source), which contains a bunch of *.props files: one special one
> called driver.classes.props, which has the mapping between (the keys)
> fully-qualified class name of a class which has a main() method, and the
> "short-name" (the values) and brief description.  The current file is just
> the following:
> 
> ###
> org.apache.mahout.utils.vectors.VectorDumper = vectordump : Dump vectors
> from a sequence file to text
> org.apache.mahout.utils.clustering.ClusterDumper = clusterdump : Dump
> cluster output to text
> org.apache.mahout.utils.SequenceFileDumper = seqdumper : Generic Sequence
> File dumper
> org.apache.mahout.clustering.kmeans.KMeansDriver = kmeans : K-means
> clustering
> org.apache.mahout.clustering.fuzzykmeans.FuzzyKMeansDriver = fkmeans : Fuzzy
> K-means clustering
> org.apache.mahout.clustering.lda.LDADriver = lda : Latent Dirchlet
> Allocation
> org.apache.mahout.fpm.pfpgrowth.FPGrowthDriver = fpg : Frequent Pattern
> Growth
> org.apache.mahout.clustering.dirichlet.DirichletDriver = dirichlet :
> Dirichlet Clustering
> org.apache.mahout.clustering.meanshift.MeanShiftCanopyDriver = meanshift :
> Mean Shift clustering
> org.apache.mahout.clustering.canopy.CanopyDriver = canopy : Canopy
> clustering
> org.apache.mahout.utils.vectors.lucene.Driver = lucene.vector : Generate
> Vectors from a Lucene index
> org.apache.mahout.text.SequenceFilesFromDirectory = seqdirectory : Generate
> sequence files (of Text) from a directory
> org.apache.mahout.text.SparseVectorsFromSequenceFiles = seq2sparse: Sparse
> Vector generation from Text sequence files
> org.apache.mahout.text.WikipediaToSequenceFile = seqwiki : Wikipedia xml
> dump to sequence file
> org.apache.mahout.classifier.bayes.TestClassifier = testclassifier : Test
> Bayes Classifier
> org.apache.mahout.classifier.bayes.TrainClassifier = trainclassifier : Train
> Bayes Classifier
> org.apache.mahout.math.hadoop.decomposer.DistributedLanczosSolver = svd :
> Lanczos Singular Value Decomposition
> org.apache.mahout.math.hadoop.decomposer.EigenVerificationJob = cleansvd :
> Cleanup and verification of SVD output
> ###
> 
> It's meant to be read into java.util.Properties, where the values on the
> right hand side are further split by ":" into the short-name (to be used on
> the command-line) and the description (printed to stdout if an invalid input
> is made or "-h" is used with no class name to run).  *If there are missing
> classes from this list, please add them!*
> 
>  *) there are also a bunch of files in conf/ which are named
> <shortName>.props, where <shortName> is one of the driver.classes.props
> above.  These files are  mostly empty now (well, commented out), but for
> example, conf/svd.props is currently:
> 
> #i|input =
> #o|output =
> #nr|numRows =
> #nc|numCols =
> #r|rank =
> #t|tempDir =
> 
> the format of these props files is that the key is of the form
> "singleDashCmdLineOpt|doubleDashCmdLineOpt", (if there is no "|" in the key,
> the short and long form will be assumed to be the same) and the value is
> whatever you would want that option to be (does not currently support
> options with no value, this is a TODO).   So for example if you had a
> command line such as:
> 
>> $MAHOUT_HOME/bin/mahout svd --input /path/to/input -o /path/to/output -nr
> <numRows> --numCols <numCols> -r <rank> -t /tmp/svd
> 
> You could just uncomment the lines in conf/svd.conf as
> 
> i|input = /path/to/input
> o|output = /path/to/output
> nr|numRows = <numRows>
> nc|numCols = <numCols>
> r|rank = <rank>
> t|tempDir = /tmp/svd
> 
> and run as
> 
>> $MAHOUT_HOME/bin/mahout svd
> 
> If you wanted to run a second time, but you didn't want to overwrite your
> old results, you could then do
> 
>> $MAHOUT_HOME/bin/mahout svd -o /path/to/newOutput
> 
> which would override /path/to/output and instead use /path/to/newOutput,
> with all the other properties coming from the svd.props.
> 
>  *) the $MAHOUT_HOME/conf directory is just a template - the mahout shell
> script adds $MAHOUT_CONF_DIR to the classpath (or $MAHOUT_HOME/conf if
> $MAHOUT_CONF_DIR is not defined), and MahoutDriver reads the properties
> files from the classpath.
> 
>  *) running on Hadoop:  if your $HADOOP_HOME and $HADOOP_CONF_DIR are set,
> the mahout shell script automatically launches your requested main method to
> your hadoop cluster, otherwise it's run locally.
> 
>  *) if your main() isn't defined in driver.classes.properties, that's ok,
> it'll still run via:
> 
> 
> $MAHOUT_HOME/bin/mahout org.apache.mahout.blah.blah.SomeOtherDriver [remaining
> args]
> 
> and in fact, if you put "org.apache.mahout.blah.blah.SomeOtherDriver.props"
> on your classpath, and has the format for the <shortName>.props listed
> above, it will be used for default properties for this class.
> 
> --------
> 
> I'll put this up in some nicer form for the wiki in the next couple of days.
> 
> 
> Try out various driver classes that you use - we all use different ones, so
> getting some dev/user manual test coverage would be nice, because it's kinda
> tricky to unit test shell scripts and command line args and env variables
> (and running on a real cluster, etc...).  We should try to fix any bugs
> before release.
> 
> Feedback welcome.  It's hacky, but it adds some useful functionality, and we
> can clean up the props-file syntax (or ditch it for xml/yaml/json/whatever)
> as needed later.
> 
>  -jake

--------------------------
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem using Solr/Lucene: http://www.lucidimagination.com/search


Mime
View raw message