spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Matthias Boehm (JIRA)" <>
Subject [jira] [Commented] (SPARK-15882) Discuss distributed linear algebra in package
Date Sun, 12 Jun 2016 00:16:20 GMT


Matthias Boehm commented on SPARK-15882:

I really like this direction and think it has the potential to become a higher level API for
Spark ML, as data frames and data sets have become for Spark SQL.

If there is interest, we'd like to help contributing to this feature by porting over a subset
of distributed linear algebra operations from SystemML.

General Goals: From my perspective, we should aim for an API that hides the underlying data
representation (e.g., RDD/Dataset, sparse/dense, blocking configurations, block/row/coordinate,
partitioning etc). Furthermore, it would be great to make it easy to swap out the used local
matrix library. This approach would allow people to plug in their custom operations (e.g.,
native BLAS libraries/kernels or compressed block operations), while still relying on a common
API and scheme for distributing blocks.

RDDs over Datasets: For the internal implementation, I would favor RDDs over Datasets because
(1) RDDs allow for more flexibility (e.g., reduceByKey, combineByKey, partitioning-preserving
operations), and (2) encoders don't offer much benefit for blocked representations as the
per-block overhead is typically negligible. 

Basic Operations: Initially, I would start with a small well-defined set of operations including
matrix multiplications, unary and binary operations (e.g., arithmetic/comparison), unary aggregates
(e.g., sum/rowSums/colSums, min/max/mean/sd), reorg operations (transpose/diag/reshape/order),
and cumulative aggregates (e.g., cumsum).

Towards Optimization: Internally, we could implement alternative operations but hide them
under a common interface. For example, matrix multiplication would be exposed as 'multiply'
(consistent with local linalg) - internally, however, we would select between alternative
operations (see,
based on a simple rule set or user-provided hints as done in Spark SQL. Later, we could think
about a more sophisticated optimizer, potentially relying on the existing catalyst infrastructure.
What do you think? 

> Discuss distributed linear algebra in package
> ------------------------------------------------------
>                 Key: SPARK-15882
>                 URL:
>             Project: Spark
>          Issue Type: Brainstorming
>          Components: ML
>            Reporter: Joseph K. Bradley
> This JIRA is for discussing how org.apache.spark.mllib.linalg.distributed.* should be
migrated to
> Initial questions:
> * Should we use Datasets or RDDs underneath?
> * If Datasets, are there missing features needed for the migration?
> * Do we want to redesign any aspects of the distributed matrices during this move?

This message was sent by Atlassian JIRA

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message