mahout-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ted Dunning <ted.dunn...@gmail.com>
Subject Re: Call for vote on integrating h2o
Date Mon, 14 Jul 2014 16:43:00 GMT
On Mon, Jul 14, 2014 at 9:36 AM, Pat Ferrel <pat.ferrel@gmail.com> wrote:

> 1) every change to the DSL should be implemented either in core-math or in
> _both_ engines, right? So every committer will have to be willing to take
> this on when changing the DSL, right? We don’t want divergence in DSL
> implementation.
>

Well, I think that every committer should sign up to build a bit of a
consortium of engine-oriented committers to handle the change.  There will
be some specialization before long.


> 2) are we going to allow the build to be broken for extended periods
> (hopefully only a day or two) until one or the other expert gets time to
> help with a DSL implementation?
>

No.  I think that the original committer should insert a stub
implementation that throws an exception and file a JIRA.  The unit test for
the capability may have to be limited temporarily, but the build should not
break.  The engine-doesn't-do-this JIRA should be a release stopper.



> This is for cases where #1 is not possible. This will happen with both
> tests and abstract defs in core-math that are carried through other engine
> specific classes. The way to get things to compile may not be immediately
> obvious so to keep things going a profile or target for each engine might
> help.
>

Profile is an interesting idea.


> 3) This will create an instant split in what algos are implemented on h2o
> and spark. We should clearly mark these and ideally minimize them.
>

Agree.


>  4) Users are going to be confused. Do they need to install Spark or not,
> what runs on what, what are the differences? The ideal is to say it all
> runs on both so all users have to do is choose their engine. But that may
> never happen. How do we handle this? There is coming confusion over Hadoop
> mr vs Spark, we don’t want to add to this.
>

Fair point.  Just like the confusion between XFS and EXT3 and EXT4 and ZFS.
 Needs documentation.


>  5) Can we agree on file level formats and/or other ways to pass a
> parallelized drm from one engine to the other? This will allow us to create
> hybrid pipelines, potentially easing user confusion.
>

I want to avoid file level data communication as much as possible.

Will it be possible to make the file handling generic?  I can see how it
might be and how it might not be possible.  Can we push the file handling
back on the user?  Can we only support a few persistence technologies (say,
local file, hdfs and URL)?

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message