spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Leif Walsh (JIRA)" <>
Subject [jira] [Commented] (SPARK-24258) SPIP: Improve PySpark support for ML Matrix and Vector types
Date Tue, 12 Jun 2018 23:54:00 GMT


Leif Walsh commented on SPARK-24258:

I think for PySpark users, we could just make it easy to use python UDFs and use numpy and
pandas code directly. That’s probably the model those users want anyway, so they can be
sure of numerical stability w.r.t. the pandas and numpy code they already have. 

I realize that then leaves the Scala and Java API users out in the cold, but it’s a possible
place to start. Really, I think the “add many many linear algebra functions” problem is
only a problem for the JVM APIs, where we’d need to link them up to breeze and test they
do the same thing as the numpy ones. 

> SPIP: Improve PySpark support for ML Matrix and Vector types
> ------------------------------------------------------------
>                 Key: SPARK-24258
>                 URL:
>             Project: Spark
>          Issue Type: New Feature
>          Components: ML, PySpark
>    Affects Versions: 2.3.0
>            Reporter: Leif Walsh
>            Priority: Major
> h1. Background and Motivation:
> In Spark ML ({{}}), there are four column types you can construct, {{SparseVector}},
{{DenseVector}}, {{SparseMatrix}}, and {{DenseMatrix}}.  In PySpark, you can construct one
of these vectors with {{VectorAssembler}}, and then you can run python UDFs on these columns,
and use {{toArray()}} to get numpy ndarrays and do things with them.  They also have a small
native API where you can compute {{dot()}}, {{norm()}}, and a few other things with them (I
think these are computed in scala, not python, could be wrong).
> For statistical applications, having the ability to manipulate columns of matrix and
vector values (from here on, I will use the term tensor to refer to arrays of arbitrary dimensionality,
matrices are 2-tensors and vectors are 1-tensors) would be powerful.  For example, you could
use PySpark to reshape your data in parallel, assemble some matrices from your raw data, and
then run some statistical computation on them using UDFs leveraging python libraries like
statsmodels, numpy, tensorflow, and scikit-learn.
> I propose enriching the {{}} types in the following ways:
> # Expand the set of column operations one can apply to tensor columns beyond the few
functions currently available on these types.  Ideally, the API should aim to be as wide as
the numpy ndarray API, but would wrap Breeze operations.  For example, we should provide {{DenseVector.outerProduct()}}
so that a user could write something like {{df.withColumn("XtX", df["X"].outerProduct(df["X"]))}}.
> # Make sure all ser/de mechanisms (including Arrow) understand these types, and faithfully
represent them as natural types in all languages (in scala and java, Breeze objects, in python,
numpy ndarrays rather than the types that wrap them, in SparkR, I'm not
sure what, but something natural) when applying UDFs or collecting with {{toPandas()}}.
> # Improve the construction of these types from scalar columns.  The {{VectorAssembler}}
API is not very ergonomic.  I propose something like {{df.withColumn("predictors", Vector.of(df["feature1"],
df["feature2"], df["feature3"]))}}.
> h1. Target Personas:
> Data scientists, machine learning practitioners, machine learning library developers.
> h1. Goals:
> This would allow users to do more statistical computation in Spark natively, and would
allow users to apply python statistical computation to data in Spark using UDFs.
> h1. Non-Goals:
> I suppose one non-goal is to reimplement something like statsmodels using Breeze data
structures and computation.  That could be seen as an effort to enrich Spark ML itself, but
is out of scope of this effort.  This effort is just to make it possible and easy to apply
existing python libraries to tensor values in parallel.
> h1. Proposed API Changes:
> Add the above APIs to PySpark and the other language bindings.  I think the list is:
> # {{*columns)}}
> # {{<not sure what goes here, maybe we don't provide this>)}}
> # For each of the matrix and vector types in {{}}, add more methods
like {{outerProduct}}, {{matmul}}, {{kron}}, etc.
has a good list to look at.
> Also, change python UDFs so that these tensor types are passed to the python function
not as \{Sparse,Dense\}\{Matrix,Vector\} objects that wrap {{numpy.ndarray}}, but as {{numpy.ndarray}}
objects by themselves, and interpret return values that are {{numpy.ndarray}} objects back
into the spark types.

This message was sent by Atlassian JIRA

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message