Return-Path: X-Original-To: apmail-mahout-dev-archive@www.apache.org Delivered-To: apmail-mahout-dev-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 1E87B10052 for ; Thu, 5 Mar 2015 18:01:51 +0000 (UTC) Received: (qmail 53482 invoked by uid 500); 5 Mar 2015 18:01:50 -0000 Delivered-To: apmail-mahout-dev-archive@mahout.apache.org Received: (qmail 53402 invoked by uid 500); 5 Mar 2015 18:01:50 -0000 Mailing-List: contact dev-help@mahout.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@mahout.apache.org Delivered-To: mailing list dev@mahout.apache.org Received: (qmail 53390 invoked by uid 99); 5 Mar 2015 18:01:49 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 05 Mar 2015 18:01:49 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW X-Spam-Check-By: apache.org Received-SPF: error (nike.apache.org: local policy) Received: from [209.85.192.172] (HELO mail-pd0-f172.google.com) (209.85.192.172) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 05 Mar 2015 18:01:24 +0000 Received: by pdev10 with SMTP id v10so7670437pde.13 for ; Thu, 05 Mar 2015 09:59:30 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:from:content-type:message-id:mime-version :subject:date:references:to:in-reply-to; bh=4OSxfWmArXC1Luej/qH5azpqbhKGD7VBA1tPI0zF9b8=; b=J4U0UvCdwEy7occgFrRcluCLluhvVNpwx+P/gaNOa4lr4oVNW6zjqU3DIIDgKD7ver YWlJlcDZaFVSPEQtFRVXC3+KzZP0IWt9LzkaAsYcC5g5GXx1+YWyopYaPbneidup3hAo lDWe0KbIl3d1oPBQSZEwr/JRqHLNR3C72D+EuRcY+pJw3BlboeIUMUr/9Fd8ayq7GsYZ 4trPUMVclEVXC6DGcfAq+Ch7tvhzZN0NPVB2eqlULiCBbimOQ5tywJWxYDC3VHxlCTVh 4di1MmUCtuWgeXV9UrzPSHHvIExeNxs3qDs7HWyXJ6GuHxGinrwKHUD+OgRO8sNDnDD3 SZIQ== X-Gm-Message-State: ALoCoQkxAmmSh8Z/gPfW+Y6JxXASQrmNdmSyeEqZMWZX9VCqFiPrMpJIf8WGOuIpiL+E9ASFp+X3 X-Received: by 10.68.68.167 with SMTP id x7mr18373519pbt.23.1425578370196; Thu, 05 Mar 2015 09:59:30 -0800 (PST) Received: from [192.168.0.2] (c-24-22-234-117.hsd1.wa.comcast.net. [24.22.234.117]) by mx.google.com with ESMTPSA id ox10sm7551080pbc.17.2015.03.05.09.59.28 for (version=TLSv1 cipher=ECDHE-RSA-RC4-SHA bits=128/128); Thu, 05 Mar 2015 09:59:29 -0800 (PST) From: Pat Ferrel Content-Type: multipart/alternative; boundary="Apple-Mail=_6CF40413-351A-43E8-90F9-9E0DE6AEC41D" Message-Id: Mime-Version: 1.0 (Mac OS X Mail 8.2 \(2070.6\)) Subject: Re: Next release Date: Thu, 5 Mar 2015 09:59:27 -0800 References: <2117503371.1330946.1425576660078.JavaMail.yahoo@mail.yahoo.com> To: Mahout Dev List In-Reply-To: X-Mailer: Apple Mail (2.2070.6) X-Virus-Checked: Checked by ClamAV on apache.org --Apple-Mail=_6CF40413-351A-43E8-90F9-9E0DE6AEC41D Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset=utf-8 Seems like we need the top list to be responded to also. Agree about similarity but a completely different method is needed for = cosine and the other actual distance measures. The way the old Hadoop = code did it is more appropriate. I=E2=80=99ll put it on my list. > On Mar 5, 2015, at 9:46 AM, Andrew Musselman = wrote: >=20 > Agree with Suneel's comments. >=20 > So you're proposing these four things for 0.10, right? I'm good with = these. >=20 > 1) mrlegacy & scala dependency reduction and possible split > 2) sync with most widely used Spark version (implies frequent releases = to stay synced with big distros I suspect) > 3) the release build is completely broken. No artifacts are created = for scala, spark, or h2o. No hosted scaladocs are created afaik. > 4) commitment to revamping the Mahout docs. They look more like 0.9+ = than anything like what Mahout is today. >=20 >=20 > On Thu, Mar 5, 2015 at 9:31 AM, Suneel Marthi > wrote: >=20 > Agree with most of the points outlined below, next steps would be to = work towards 0.10.=20 >=20 >> From: Pat Ferrel > >> To: Suneel Marthi >; ap.dev >; Andrew Musselman = >=20 >> Sent: Thursday, March 5, 2015 12:11 PM >> Subject: Next release >>=20 >> I=E2=80=99d send this to @dev if it won=E2=80=99t turn into a public = argument. Maybe leave out the wishlist? >>=20 >> Hopefully people will chime in with opinions or status but here=E2=80=99= s what it looks like to me: >>=20 >> 1) The DSL needs the mrlegacy pruning that is ready but held up by = external issues. This would be required if we do a project split. Also = the external deps have been reduced to nearly the minimum and are = written to a smallish jar in the spark module. It is possible to do more = fine grained class-level shading but not sure it=E2=80=99s needed. >> 2) significant DSL additions are held up by external issues but there = is already SSVD, PCA, QR and pretty mature linear algebra ops. >> 3) similarity, item (column) and row seem to be fine with LLR only, = and therefor are mainly for recommender use cases. > >>>> It would be nice to generalize this to be able to use any = similarity measure before next release. >=20 >> 4) Naive Bayes only partial pipeline for text classification is = implemented in Scala but NB itself is working, TD-IDF in progress >> 5) There is some distributed aggregation work that is waiting in a PR = and seems to be stalled. I=E2=80=99d vote to see this included. >>=20 > >>> +1 >=20 >> What is a minimum release? >>=20 >> Sort of an odd question without a clear idea of what Mahout is. I see = its future as a scalable R-like environment integrated with Scala and = distributed computation engines like Spark. Put another way it is a = distributed optimized linear algebra environment and library with some = important higher level algorithms. It is general where things like MLlib = do not attempt to be. >>=20 >> When would you use Mahout vs MLlib or H2O? If you need deep learning, = look at H2O, if you need Kmeans look at MLlib, if you require or want to = mix-in a general linear algebra engine look at Mahout=E2=80=99s DSL = since it plays well with MLlib and to some degree H2O. >>=20 >> What is a minimum release given the above definition? >>=20 >> Seems like polishing up the 5 things mentioned above along with: >> 1) mrlegacy & scala dependency reduction and possible split >> 2) sync with most widely used Spark version (implies frequent = releases to stay synced with big distros I suspect) >> 3) the release build is completely broken. No artifacts are created = for scala, spark, or h2o. No hosted scaladocs are created afaik. >> 4) commitment to revamping the Mahout docs. They look more like 0.9+ = than anything like what Mahout is today. >>=20 >> Not sure we should go down this rat hole right now so feel free to = ignore this but my intermediate term and post release wishlist is: >>=20 >> 1) more stats and polish to the shell (savable workspaces, etc) >> 2) some helpers/conversions to make accessing MLlib easier. For = instance a few lines of code would make KMeans usable with DRMs=20 >> 3) a lightweight package formalization for adding new contributor = based high level algorithms=E2=80=94maybe along the lines of Examples = which pull in code from github and include their own build mechanism. > +1 >> 4) finish the text pipeline > +1, would explore the new text processing features available in Lucene = 5. Please don't go by how MlLib does this >> 5) integrate Spark dataframes with DRMs and IndexedDatasets > +1 >> 6) retire sequence files for PMML, JSON (SchemaRDD/Dataframes), = CSV=E2=80=94whatever. These are only needed as input and output not = intermediate results anymore so why have sequence files when supporting = IO to other tools like Hive, Spark SQL, Solr/ES and others is more = important? >>=20 > +100, sequencefiles have been Mahout's nemesis all along >=20 >=20 >=20 >=20 >=20 --Apple-Mail=_6CF40413-351A-43E8-90F9-9E0DE6AEC41D--