mahout-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Gokhan Capan (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (MAHOUT-1529) Finalize abstraction of distributed logical plans from backend operations
Date Sun, 01 Jun 2014 15:04:02 GMT

    [ https://issues.apache.org/jira/browse/MAHOUT-1529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14014985#comment-14014985
] 

Gokhan Capan edited comment on MAHOUT-1529 at 6/1/14 3:03 PM:
--------------------------------------------------------------

[~dlyubimov], I imagine in the near future we will want to add a matrix implementation with
fast row and column access for memory-based algorithms such as neighborhood based recommendation.
This could be a new persistent storage engineered for locality preservation of kNN, the new
Solr backend potentially cast to a Matrix, or something else. 

Anyway, my point is that we could want to add different types of distributed matrices with
engine (or data structure) specific strengths in the future. I suggest turning each bahavior
(such as Caching) into an additional trait, which the distributed execution engine (or data
structure) author can mixin to her concrete implementation (For example Spark's matrix is
one with Caching and Broadcasting). It might even help with easier logical planning (if it
supports caching cache it, if partitioned in the same way do this else do this, if one matrix
is small broadcast it etc.). 

So I suggest a  a base Matrix trait with nrows and ncols methods (as it currently is), a BatchExecution
trait with methods for partitioning and execution in parallel behavior, a Caching trait with
methods for caching/uncaching behavior, in the future a RandomAccess trait with methods for
accessing rows and columns (and possibly cells) functionality. 

Then a concrete DRM (like) would be a Matrix with BatchExecution and possibly Caching, a concrete
RandomAccessMatrix would be a Matrix with RandomAccess, and so on. What do you think and if
you and others are positive, how do you think that should be handled?


was (Author: gokhancapan):
[~dlyubimov], I imagine in the near future we will want to add a matrix implementation with
fast row and column access for in-memory algorithms such as neighborhood based recommendation.
This could be a new persistent storage engineered for locality preservation of kNN, the new
Solr backend potentially cast to a Matrix, or something else. 

Anyway, my point is that we could want to add different types of distributed matrices with
engine (or data structure) specific strengths in the future. I suggest turning each bahavior
(such as Caching) into an additional trait, which the distributed execution engine (or data
structure) author can mixin to her concrete implementation (For example Spark's matrix is
one with Caching and Broadcasting). It might even help with easier logical planning (if it
supports caching cache it, if partitioned in the same way do this else do this, if one matrix
is small broadcast it etc.). 

So I suggest a  a base Matrix trait with nrows and ncols methods (as it currently is), a BatchExecution
trait with methods for partitioning and execution in parallel behavior, a Caching trait with
methods for caching/uncaching behavior, in the future a RandomAccess trait with methods for
accessing rows and columns (and possibly cells) functionality. 

Then a concrete DRM (like) would be a Matrix with BatchExecution and possibly Caching, a concrete
RandomAccessMatrix would be a Matrix with RandomAccess, and so on. What do you think and if
you and others are positive, how do you think that should be handled?

> Finalize abstraction of distributed logical plans from backend operations
> -------------------------------------------------------------------------
>
>                 Key: MAHOUT-1529
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-1529
>             Project: Mahout
>          Issue Type: Improvement
>            Reporter: Dmitriy Lyubimov
>            Assignee: Dmitriy Lyubimov
>             Fix For: 1.0
>
>
> We have a few situations when algorithm-facing API has Spark dependencies creeping in.

> In particular, we know of the following cases:
> -(1) checkpoint() accepts Spark constant StorageLevel directly;-
> -(2) certain things in CheckpointedDRM;-
> -(3) drmParallelize etc. routines in the "drm" and "sparkbindings" package.-
> -(5) drmBroadcast returns a Spark-specific Broadcast object-
> (6) Stratosphere/Flink conceptual api changes.
> *Current tracker:* PR #1 https://github.com/apache/mahout/pull/1 - closed, need new PR
for remaining things once ready.
> *Pull requests are welcome*.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Mime
View raw message