mahout-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Pat Ferrel <...@occamsmachete.com>
Subject Re: Next release
Date Thu, 05 Mar 2015 17:59:27 GMT
Seems like we need the top list to be responded to also.

Agree about similarity but a completely different method is needed for cosine and the other
actual distance measures. The way the old Hadoop code did it is more appropriate. I’ll put
it on my list.


> On Mar 5, 2015, at 9:46 AM, Andrew Musselman <andrew.musselman@gmail.com> wrote:
> 
> Agree with Suneel's comments.
> 
> So you're proposing these four things for 0.10, right?  I'm good with these.
> 
> 1) mrlegacy & scala dependency reduction and possible split
> 2) sync with most widely used Spark version (implies frequent releases to stay synced
with big distros I suspect)
> 3) the release build is completely broken. No artifacts are created for scala, spark,
or h2o. No hosted scaladocs are created afaik.
> 4) commitment to revamping the Mahout docs. They look more like 0.9+ than anything like
what Mahout is today.
> 
> 
> On Thu, Mar 5, 2015 at 9:31 AM, Suneel Marthi <suneel_marthi@yahoo.com <mailto:suneel_marthi@yahoo.com>>
wrote:
> 
> Agree with most of the points outlined below, next steps would be to work towards 0.10.

> 
>> From: Pat Ferrel <pat@occamsmachete.com <mailto:pat@occamsmachete.com>>
>> To: Suneel Marthi <suneel_marthi@yahoo.com <mailto:suneel_marthi@yahoo.com>>;
ap.dev <ap.dev@outlook.com <mailto:ap.dev@outlook.com>>; Andrew Musselman <andrew.musselman@gmail.com
<mailto:andrew.musselman@gmail.com>> 
>> Sent: Thursday, March 5, 2015 12:11 PM
>> Subject: Next release
>> 
>> I’d send this to @dev if it won’t turn into a public argument. Maybe leave out
the wishlist?
>> 
>> Hopefully people will chime in with opinions or status but here’s what it looks
like to me:
>> 
>> 1) The DSL needs the mrlegacy pruning that is ready but held up by external issues.
This would be required if we do a project split. Also the external deps have been reduced
to nearly the minimum and are written to a smallish jar in the spark module. It is possible
to do more fine grained class-level shading but not sure it’s needed.
>> 2) significant DSL additions are held up by external issues but there is already
SSVD, PCA, QR and pretty mature linear algebra ops.
>> 3) similarity, item (column) and row seem to be fine with LLR only, and therefor
are mainly for recommender use cases.
> >>>> It would be nice to generalize this to be able to use any similarity
measure before next release.
> 
>> 4) Naive Bayes only partial pipeline for text classification is implemented in Scala
but NB itself is working, TD-IDF in progress
>> 5) There is some distributed aggregation work that is waiting in a PR and seems to
be stalled. I’d vote to see this included.
>> 
> >>> +1
> 
>> What is a minimum release?
>> 
>> Sort of an odd question without a clear idea of what Mahout is. I see its future
as a scalable R-like environment integrated with Scala and distributed computation engines
like Spark. Put another way it is a distributed optimized linear algebra environment and library
with some important higher level algorithms. It is general where things like MLlib do not
attempt to be.
>> 
>> When would you use Mahout vs MLlib or H2O? If you need deep learning, look at H2O,
if you need Kmeans look at MLlib, if you require or want to mix-in a general linear algebra
engine look at Mahout’s DSL since it plays well with MLlib and to some degree H2O.
>> 
>> What is a minimum release given the above definition?
>> 
>> Seems like polishing up the 5 things mentioned above along with:
>> 1) mrlegacy & scala dependency reduction and possible split
>> 2) sync with most widely used Spark version (implies frequent releases to stay synced
with big distros I suspect)
>> 3) the release build is completely broken. No artifacts are created for scala, spark,
or h2o. No hosted scaladocs are created afaik.
>> 4) commitment to revamping the Mahout docs. They look more like 0.9+ than anything
like what Mahout is today.
>> 
>> Not sure we should go down this rat hole right now so feel free to ignore this but
my intermediate term and post release wishlist is:
>> 
>> 1) more stats and polish to the shell (savable workspaces, etc)
>> 2) some helpers/conversions to make accessing MLlib easier. For instance a few lines
of code would make KMeans usable with DRMs 
>> 3) a lightweight package formalization for adding new contributor based high level
algorithms—maybe along the lines of Examples which pull in code from github and include
their own build mechanism.
> +1
>> 4) finish the text pipeline
> +1, would explore the new text processing features available in Lucene 5. Please don't
go by how MlLib does this
>> 5) integrate Spark dataframes with DRMs and IndexedDatasets
> +1
>> 6) retire sequence files for PMML, JSON (SchemaRDD/Dataframes), CSV—whatever. These
are only needed as input and output not intermediate results anymore so why have sequence
files when supporting IO to other tools like Hive, Spark SQL, Solr/ES and others is more important?
>> 
> +100, sequencefiles have been Mahout's nemesis all along
> 
> 
> 
> 
> 

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message