From issues-return-193621-archive-asf-public=cust-asf.ponee.io@spark.apache.org  Fri Jun  8 20:17:05 2018
Return-Path: <issues-return-193621-archive-asf-public=cust-asf.ponee.io@spark.apache.org>
X-Original-To: archive-asf-public@cust-asf.ponee.io
Delivered-To: archive-asf-public@cust-asf.ponee.io
Received: from mail.apache.org (hermes.apache.org [140.211.11.3])
	by mx-eu-01.ponee.io (Postfix) with SMTP id EBCFA180674
	for <archive-asf-public@cust-asf.ponee.io>; Fri,  8 Jun 2018 20:17:04 +0200 (CEST)
Received: (qmail 27081 invoked by uid 500); 8 Jun 2018 18:17:03 -0000
Mailing-List: contact issues-help@spark.apache.org; run by ezmlm
Precedence: bulk
List-Help: <mailto:issues-help@spark.apache.org>
List-Unsubscribe: <mailto:issues-unsubscribe@spark.apache.org>
List-Post: <mailto:issues@spark.apache.org>
List-Id: <issues.spark.apache.org>
Delivered-To: mailing list issues@spark.apache.org
Received: (qmail 26781 invoked by uid 99); 8 Jun 2018 18:17:03 -0000
Received: from pnap-us-west-generic-nat.apache.org (HELO spamd3-us-west.apache.org) (209.188.14.142)
    by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 08 Jun 2018 18:17:03 +0000
Received: from localhost (localhost [127.0.0.1])
	by spamd3-us-west.apache.org (ASF Mail Server at spamd3-us-west.apache.org) with ESMTP id 79942180971
	for <issues@spark.apache.org>; Fri,  8 Jun 2018 18:17:03 +0000 (UTC)
X-Virus-Scanned: Debian amavisd-new at spamd3-us-west.apache.org
X-Spam-Flag: NO
X-Spam-Score: -109.511
X-Spam-Level:
X-Spam-Status: No, score=-109.511 tagged_above=-999 required=6.31
	tests=[ENV_AND_HDR_SPF_MATCH=-0.5, KAM_ASCII_DIVIDERS=0.8,
	RCVD_IN_DNSWL_MED=-2.3, SPF_PASS=-0.001, T_RP_MATCHES_RCVD=-0.01,
	USER_IN_DEF_SPF_WL=-7.5, USER_IN_WHITELIST=-100] autolearn=disabled
Received: from mx1-lw-us.apache.org ([10.40.0.8])
	by localhost (spamd3-us-west.apache.org [10.40.0.10]) (amavisd-new, port 10024)
	with ESMTP id oBseiM-j_4cT for <issues@spark.apache.org>;
	Fri,  8 Jun 2018 18:17:02 +0000 (UTC)
Received: from mailrelay1-us-west.apache.org (mailrelay1-us-west.apache.org [209.188.14.139])
	by mx1-lw-us.apache.org (ASF Mail Server at mx1-lw-us.apache.org) with ESMTP id F180C5F49F
	for <issues@spark.apache.org>; Fri,  8 Jun 2018 18:17:01 +0000 (UTC)
Received: from jira-lw-us.apache.org (unknown [207.244.88.139])
	by mailrelay1-us-west.apache.org (ASF Mail Server at mailrelay1-us-west.apache.org) with ESMTP id E83B7E0EFA
	for <issues@spark.apache.org>; Fri,  8 Jun 2018 18:17:00 +0000 (UTC)
Received: from jira-lw-us.apache.org (localhost [127.0.0.1])
	by jira-lw-us.apache.org (ASF Mail Server at jira-lw-us.apache.org) with ESMTP id 5D68D2109F
	for <issues@spark.apache.org>; Fri,  8 Jun 2018 18:17:00 +0000 (UTC)
Date: Fri, 8 Jun 2018 18:17:00 +0000 (UTC)
From: "Li Jin (JIRA)" <jira@apache.org>
To: issues@spark.apache.org
Message-ID: <JIRA.13158914.1526164995000.133522.1528481820380@Atlassian.JIRA>
In-Reply-To: <JIRA.13158914.1526164995000@Atlassian.JIRA>
References: <JIRA.13158914.1526164995000@Atlassian.JIRA> <JIRA.13158914.1526164995647@jira-lw-us.apache.org>
Subject: [jira] [Commented] (SPARK-24258) SPIP: Improve PySpark support for
 ML Matrix and Vector types
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: quoted-printable
X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394


    [ https://issues.apache.org/jira/browse/SPARK-24258?page=3Dcom.atlassia=
n.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=3D165=
06399#comment-16506399 ]=20

Li Jin commented on SPARK-24258:
--------------------------------

I ran into [~mengxr] and chatted about this. Seems a good first step is to =
have tensor type to be first-class type in Spark DataFrame.=C2=A0For operat=
ions, there is concerns about having to add many many linear algebra functi=
ons in Spark codebase, so it's not clear=C2=A0whether it's a good idea.

Any thoughts?

> SPIP: Improve PySpark support for ML Matrix and Vector types
> ------------------------------------------------------------
>
>                 Key: SPARK-24258
>                 URL: https://issues.apache.org/jira/browse/SPARK-24258
>             Project: Spark
>          Issue Type: New Feature
>          Components: ML, PySpark
>    Affects Versions: 2.3.0
>            Reporter: Leif Walsh
>            Priority: Major
>
> h1. Background and Motivation:
> In Spark ML ({{pyspark.ml.linalg}}), there are four column types you can =
construct, {{SparseVector}}, {{DenseVector}}, {{SparseMatrix}}, and {{Dense=
Matrix}}.  In PySpark, you can construct one of these vectors with {{Vector=
Assembler}}, and then you can run python UDFs on these columns, and use {{t=
oArray()}} to get numpy ndarrays and do things with them.  They also have a=
 small native API where you can compute {{dot()}}, {{norm()}}, and a few ot=
her things with them (I think these are computed in scala, not python, coul=
d be wrong).
> For statistical applications, having the ability to manipulate columns of=
 matrix and vector values (from here on, I will use the term tensor to refe=
r to arrays of arbitrary dimensionality, matrices are 2-tensors and vectors=
 are 1-tensors) would be powerful.  For example, you could use PySpark to r=
eshape your data in parallel, assemble some matrices from your raw data, an=
d then run some statistical computation on them using UDFs leveraging pytho=
n libraries like statsmodels, numpy, tensorflow, and scikit-learn.
> I propose enriching the {{pyspark.ml.linalg}} types in the following ways=
:
> # Expand the set of column operations one can apply to tensor columns bey=
ond the few functions currently available on these types.  Ideally, the API=
 should aim to be as wide as the numpy ndarray API, but would wrap Breeze o=
perations.  For example, we should provide {{DenseVector.outerProduct()}} s=
o that a user could write something like {{df.withColumn("XtX", df["X"].out=
erProduct(df["X"]))}}.
> # Make sure all ser/de mechanisms (including Arrow) understand these type=
s, and faithfully represent them as natural types in all languages (in scal=
a and java, Breeze objects, in python, numpy ndarrays rather than the pyspa=
rk.ml.linalg types that wrap them, in SparkR, I'm not sure what, but someth=
ing natural) when applying UDFs or collecting with {{toPandas()}}.
> # Improve the construction of these types from scalar columns.  The {{Vec=
torAssembler}} API is not very ergonomic.  I propose something like {{df.wi=
thColumn("predictors", Vector.of(df["feature1"], df["feature2"], df["featur=
e3"]))}}.
> h1. Target Personas:
> Data scientists, machine learning practitioners, machine learning library=
 developers.
> h1. Goals:
> This would allow users to do more statistical computation in Spark native=
ly, and would allow users to apply python statistical computation to data i=
n Spark using UDFs.
> h1. Non-Goals:
> I suppose one non-goal is to reimplement something like statsmodels using=
 Breeze data structures and computation.  That could be seen as an effort t=
o enrich Spark ML itself, but is out of scope of this effort.  This effort =
is just to make it possible and easy to apply existing python libraries to =
tensor values in parallel.
> h1. Proposed API Changes:
> Add the above APIs to PySpark and the other language bindings.  I think t=
he list is:
> # {{pyspark.ml.linalg.Vector.of(*columns)}}
> # {{pyspark.ml.linalg.Matrix.of(<not sure what goes here, maybe we don't =
provide this>)}}
> # For each of the matrix and vector types in {{pyspark.ml.linalg}}, add m=
ore methods like {{outerProduct}}, {{matmul}}, {{kron}}, etc.  https://docs=
.scipy.org/doc/numpy-1.14.0/reference/routines.linalg.html has a good list =
to look at.
> Also, change python UDFs so that these tensor types are passed to the pyt=
hon function not as \{Sparse,Dense\}\{Matrix,Vector\} objects that wrap {{n=
umpy.ndarray}}, but as {{numpy.ndarray}} objects by themselves, and interpr=
et return values that are {{numpy.ndarray}} objects back into the spark typ=
es.


--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org