mahout-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Pat Ferrel <>
Subject Re: JIRA- legacy & scala labels
Date Fri, 06 Mar 2015 22:56:14 GMT
The simplest way to split the project is into engines—hadoop and spark. What is happening
with H2O? is it being used? Flink isn’t anything like ready for a release.

Again the simplest would be two packaged builds, one for legacy stuff, which would not require
Scala or Spark at all.

The other would be a maven based Scala + Spark + java math module. So this would be mostly
Scala with only the math module overlap. It requires the refactoring work that Dmitriy has
done, which would make it stand-alone. An sbt build is clearly optional here but would be
in keeping with our all-in Scala approach. Personally I like sbt a lot better than maven but
it is less mature.

The benefit would be:
1) potentially separate release schedules, hadoop not so often and eventually not at all,
spark every few days if you follow their schedule (not suggesting this)
2) much faster build times for either branch—as anyone knows, building with tests is starting
to take a long time.
3) possible use of new tool chain like sbt in scala branch
4) much simpler launcher script—mahout’s is getting a mess and doesn’t run at all on
Windows. Requiring it to support both engines is not making things easy and much work goes
into getting around old ideas like the classpath and job.jars. Creating one for each engine
would seem to reduce complexity. 
5) easier to support. If we really are going to have 4 engines the current build and launch
mechanisms along with release schedules can’t really be maintained and even 2 is ugly.

On Mar 6, 2015, at 11:52 AM, Suneel Marthi <> wrote:

On Fri, Mar 6, 2015 at 1:41 PM, Andrew Palumbo <> wrote:

> On 03/06/2015 12:44 PM, Pat Ferrel wrote:
>> This is great.
>> So we’ve talked about a name change and shortly we’ll be forced to come
>> up with something the describes what Mahout has become. Most past users
>> think of it as a scalable ML library on Hadoop. That may describe
>> Mahout-Legacy but it seems like we need a name for the Scala
>> DSL/Spark/other? part of the project. Lots of projects have sub-projects so
>> we know there is no issue with naming sub-projects. So my question to
>> everyone is:
>> Should (or can) the Top Level Project be renamed? If so to what?
> I don't like the idea of a top level name change.  I think that it would
> be a much better idea to direct our resources at polishing and developing
> what we have now.  As well, especially for this release, I think that it
> would do a disservice to the "legacy" components (which as you point out
> have not been deprecated) with ~45 completed bugfixes and several more in
> the pipe.
> I don't like the idea of renaming Mahout either and agree with AP.

>> If we don’t rename the TLP then what should we call legacy (not very
>> appealing) and scala/DSL (not a name really)
> agreed.  Legacy is not the most appealing name.  Maybe something like
> Mahout-MapReduce?  Though that could cause some confusion regarding the "no
> new MapReduce code"
> My opinion:
>> Since we are deemphasizing legacy I’m not sure there is a need to call
>> attention to it by giving it a subproject name. However it is not
>> deprecated so we need to include it in releases and even fix the minimum of
>> critical bugs for some time to come.
> agreed regarding fixing critical legacy bugs.  Looking through the issues
> last night there didn't seem to me a lot of critical bugs, and probably a
> good amount of issues can be closed out as wont fix/not an issue.


>> Mahout is getting beat up in the circles of those who talk about such
>> things and much of this is because people don’t understand what it has
>> become. Therefore I’d like to see a project rename to reset expectations.
>> Leave the name Mahout for legacy stuff and give a new name to the Scala
>> environment. Split the builds and create new docs for the Scala stuff. This
>> would seem to make it easier to document since legacy is most of what the
>> CMS documents, we could create whole new template for the new project name.
> What is the upside to splitting the builds? I'm not against it- I'm just
> not sure I understand.
>> Failing this, many of the same benefits could be gained by creating
>> legacy and scala sub-projects with better names. This I know we can do and
>> recall that things like MLlib are generally not tied to Spark when speaking
>> about them. So a subproject could have very much its own identity.
>> Looking at the long history of Mahout it seems like the current
>> generality was hard gained through implementing many special purpose
>> algorithms, some of which were grad student projects. This is where MLlib
>> is today in some ways. So a general framework and environment makes a lot
>> of sense as the evolution of Mahout. Let’s give it a name, something better
>> than DSL.
> I think that a pretty clear description of what the other side of the
> project is has been emerging recently.  IMO We need to start getting it out
> there.  Probably a good start would be to update the front page of the
> mahout site.


> I don't have any good ideas regarding names for this side of the project.
>> On Mar 5, 2015, at 7:43 PM, Andrew Musselman <>
>> wrote:
>> Thanks AP
>> On Thursday, March 5, 2015, Andrew Palumbo <> wrote:
>> I went through all of the unresolved JIRA issues and marked all with at
>>> least a "legacy" or "scala". (for lack of a better name for all that is
>>> not
>>> legacy) label. Hopefully I got them all.
>>> Some are labelled with both (math, build, documentation related to both
>>> or
>>> neither, etc.)
>>> legacy issues:
>>> project%20%3D%20MAHOUT%20AND%20resolution%20%3D%
>>> 20Unresolved%20AND%20labels%20%3D%20scala%20ORDER%20BY%20priority%20DESC
>>> "scala" issues:
>>> project%20%3D%20MAHOUT%20AND%20resolution%20%3D%
>>> 20Unresolved%20AND%20labels%20%3D%20legacy%20ORDER%20BY%
>>> 20priority%20DESC
>>> Hopefully this will help us get started closing up some old issues. I'll
>>> try to make another pass over them and close tomorrow and try to find
>>> some
>>> that need to be closed out.

View raw message