From issues-return-193265-archive-asf-public=cust-asf.ponee.io@spark.apache.org  Sat Jun  2 05:08:06 2018
Return-Path: <issues-return-193265-archive-asf-public=cust-asf.ponee.io@spark.apache.org>
X-Original-To: archive-asf-public@cust-asf.ponee.io
Delivered-To: archive-asf-public@cust-asf.ponee.io
Received: from mail.apache.org (hermes.apache.org [140.211.11.3])
	by mx-eu-01.ponee.io (Postfix) with SMTP id BE4FC18063A
	for <archive-asf-public@cust-asf.ponee.io>; Sat,  2 Jun 2018 05:08:05 +0200 (CEST)
Received: (qmail 91893 invoked by uid 500); 2 Jun 2018 03:08:04 -0000
Mailing-List: contact issues-help@spark.apache.org; run by ezmlm
Precedence: bulk
List-Help: <mailto:issues-help@spark.apache.org>
List-Unsubscribe: <mailto:issues-unsubscribe@spark.apache.org>
List-Post: <mailto:issues@spark.apache.org>
List-Id: <issues.spark.apache.org>
Delivered-To: mailing list issues@spark.apache.org
Received: (qmail 91884 invoked by uid 99); 2 Jun 2018 03:08:04 -0000
Received: from pnap-us-west-generic-nat.apache.org (HELO spamd2-us-west.apache.org) (209.188.14.142)
    by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 02 Jun 2018 03:08:04 +0000
Received: from localhost (localhost [127.0.0.1])
	by spamd2-us-west.apache.org (ASF Mail Server at spamd2-us-west.apache.org) with ESMTP id 457041A4D3A
	for <issues@spark.apache.org>; Sat,  2 Jun 2018 03:08:04 +0000 (UTC)
X-Virus-Scanned: Debian amavisd-new at spamd2-us-west.apache.org
X-Spam-Flag: NO
X-Spam-Score: -110.301
X-Spam-Level:
X-Spam-Status: No, score=-110.301 tagged_above=-999 required=6.31
	tests=[ENV_AND_HDR_SPF_MATCH=-0.5, RCVD_IN_DNSWL_MED=-2.3,
	SPF_PASS=-0.001, USER_IN_DEF_SPF_WL=-7.5, USER_IN_WHITELIST=-100]
	autolearn=disabled
Received: from mx1-lw-eu.apache.org ([10.40.0.8])
	by localhost (spamd2-us-west.apache.org [10.40.0.9]) (amavisd-new, port 10024)
	with ESMTP id DTQPr-HxIivD for <issues@spark.apache.org>;
	Sat,  2 Jun 2018 03:08:02 +0000 (UTC)
Received: from mailrelay1-us-west.apache.org (mailrelay1-us-west.apache.org [209.188.14.139])
	by mx1-lw-eu.apache.org (ASF Mail Server at mx1-lw-eu.apache.org) with ESMTP id 953555F19B
	for <issues@spark.apache.org>; Sat,  2 Jun 2018 03:08:01 +0000 (UTC)
Received: from jira-lw-us.apache.org (unknown [207.244.88.139])
	by mailrelay1-us-west.apache.org (ASF Mail Server at mailrelay1-us-west.apache.org) with ESMTP id 9C496E092E
	for <issues@spark.apache.org>; Sat,  2 Jun 2018 03:08:00 +0000 (UTC)
Received: from jira-lw-us.apache.org (localhost [127.0.0.1])
	by jira-lw-us.apache.org (ASF Mail Server at jira-lw-us.apache.org) with ESMTP id 304B121096
	for <issues@spark.apache.org>; Sat,  2 Jun 2018 03:08:00 +0000 (UTC)
Date: Sat, 2 Jun 2018 03:08:00 +0000 (UTC)
From: "Hossein Falaki (JIRA)" <jira@apache.org>
To: issues@spark.apache.org
Message-ID: <JIRA.13161231.1527027548000.86919.1527908880196@Atlassian.JIRA>
In-Reply-To: <JIRA.13161231.1527027548000@Atlassian.JIRA>
References: <JIRA.13161231.1527027548000@Atlassian.JIRA> <JIRA.13161231.1527027548154@jira-lw-us.apache.org>
Subject: [jira] [Commented] (SPARK-24359) SPIP: ML Pipelines in R
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: quoted-printable
X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394


    [ https://issues.apache.org/jira/browse/SPARK-24359?page=3Dcom.atlassia=
n.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=3D164=
98842#comment-16498842 ]=20

Hossein Falaki commented on SPARK-24359:
----------------------------------------

Yes. My bad, I meant releasing an update to CRAN for every 2.x and 3.x rele=
ase. However, if Spark does patch releases like 2.3.4, we are not required =
to push a new CRAN package, but that is an opportunity. I guess that is ide=
ntical to SparkR CRAN release cycle.

> SPIP: ML Pipelines in R
> -----------------------
>
>                 Key: SPARK-24359
>                 URL: https://issues.apache.org/jira/browse/SPARK-24359
>             Project: Spark
>          Issue Type: Improvement
>          Components: SparkR
>    Affects Versions: 3.0.0
>            Reporter: Hossein Falaki
>            Priority: Major
>              Labels: SPIP
>         Attachments: SparkML_ ML Pipelines in R-v2.pdf, SparkML_ ML Pipel=
ines in R-v3.pdf, SparkML_ ML Pipelines in R.pdf
>
>
> h1. Background and motivation
> SparkR supports calling MLlib functionality with an=C2=A0[R-friendly API|=
https://docs.google.com/document/d/10NZNSEurN2EdWM31uFYsgayIPfCFHiuIu3pCWrU=
mP_c/]. Since Spark 1.5 the (new) SparkML API which is based on=C2=A0[pipel=
ines and parameters|https://docs.google.com/document/d/1rVwXRjWKfIb-7PI6b86=
ipytwbUH7irSNLF1_6dLmh8o]=C2=A0has matured significantly. It allows users b=
uild and maintain complicated machine learning pipelines. A lot of this fun=
ctionality is difficult to expose using the simple formula-based API in Spa=
rkR.
> We propose a new R package, _SparkML_, to be distributed along with Spark=
R as part of Apache Spark. This new package will be built on top of SparkR=
=E2=80=99s APIs to expose SparkML=E2=80=99s pipeline APIs and functionality=
.
> *Why not SparkR?*
> SparkR package contains ~300 functions. Many of these shadow functions in=
 base and other popular CRAN packages. We think adding more functions to Sp=
arkR will degrade usability and make maintenance harder.
> *Why not sparklyr?*
> sparklyr is an R package developed by RStudio Inc. to expose Spark API to=
 R users. sparklyr includes MLlib API wrappers, but to the best of our know=
ledge they are not comprehensive. Also we propose a code-gen approach for t=
his package to minimize work needed to expose future MLlib API, but sparkly=
r=E2=80=99s API is manually written.
> h1. Target Personas
>  * Existing SparkR users who need more flexible SparkML API
>  * R users (data scientists, statisticians) who wish to build Spark ML pi=
pelines in R
> h1. Goals
>  * R users can install SparkML from CRAN
>  * R users will be able to import SparkML independent from SparkR
>  * After setting up a Spark session R users can
>  ** create a pipeline by chaining individual components and specifying th=
eir parameters
>  ** tune a pipeline in parallel, taking advantage of Spark
>  ** inspect a pipeline=E2=80=99s parameters and evaluation metrics
>  ** repeatedly apply a pipeline
>  * MLlib contributors can easily add R wrappers for new MLlib Estimators =
and Transformers
> h1. Non-Goals
>  * Adding new algorithms to SparkML R package which do not exist in Scala
>  * Parallelizing existing CRAN packages
>  * Changing existing SparkR ML wrapping API
> h1. Proposed API Changes
> h2. Design goals
> When encountering trade-offs in API, we will chose based on the following=
 list of priorities. The API choice that addresses a higher priority goal w=
ill be chosen.
>  # *Comprehensive coverage of MLlib API:* Design choices that make R cove=
rage of future ML algorithms difficult will be ruled out.
>  * *Semantic clarity*: We attempt to minimize confusion with other packag=
es. Between consciousness and clarity, we will choose clarity.
>  * *Maintainability and testability:* API choices that require manual mai=
ntenance or make testing difficult should be avoided.
>  * *Interoperability with rest of Spark components:* We will keep the R A=
PI as thin as possible and keep all functionality implementation in JVM/Sca=
la.
>  * *Being natural to R users:* Ultimate users of this package are R users=
 and they should find it easy and natural to use.
> The API will follow familiar R function syntax, where the object is passe=
d as the first argument of the method: =C2=A0do_something(obj, arg1, arg2).=
 All=C2=A0functions are snake_case (e.g., {{spark_logistic_regression()}}=
=C2=A0and {{set_max_iter()}}). If a constructor gets arguments, they will b=
e named arguments. For example:
> {code:java}
> > lr <- set_reg_param(set_max_iter(spark.logistic.regression()), 10), 0.1=
){code}
> When calls need to be chained, like above example, syntax can nicely tran=
slate to a natural pipeline style with help from very popular[ magrittr pac=
kage|https://cran.r-project.org/web/packages/magrittr/index.html]. For exam=
ple:
> {code:java}
> > logistic_regression() %>% set_max_iter(10) %>% set_reg_param(0.01) -> l=
r{code}
> h2. Namespace
> All new API will be under a new CRAN package, named SparkML. The package =
should be usable without needing SparkR in the namespace. The package will =
introduce a number of S4 classes that inherit from four basic classes. Here=
 we will list the basic types with a few examples. An object of any child c=
lass can be instantiated with a function call that starts with {{spark_}}.
> h2. Pipeline & PipelineStage
> A pipeline object contains one or more stages. =C2=A0
> {code:java}
> > pipeline <- spark_pipeline() %>% set_stages(stage1, stage2, stage3){cod=
e}
> Where stage1, stage2, etc are S4 objects of a PipelineStage and pipeline =
is an object of type Pipeline.
> h2. Transformers
> A Transformer is an algorithm that can transform one SparkDataFrame into =
another SparkDataFrame.
> *Example API:*
> {code:java}
> > tokenizer <- spark_tokenizer() %>%
> =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0=C2=A0set_input_col(=E2=80=9Ctex=
t=E2=80=9D) %>%
> =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0=C2=A0set_output_col(=E2=80=9Cwo=
rds=E2=80=9D)
> > tokenized.df <- tokenizer %>% transform(df)=C2=A0{code}
> h2. Estimators
> An Estimator is an algorithm which can be fit on a SparkDataFrame to prod=
uce a Transformer. E.g., a learning algorithm is an Estimator which trains =
on a DataFrame and produces a model.
> *Example API:*
> {code:java}
> lr <- spark_logistic_regression() %>%
> =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0=C2=A0set_max_iter(10) %>%
> =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0=C2=A0set_reg_param(0.001){code}
> h2. Evaluators
> An evaluator computes metrics from predictions (model outputs) and return=
s a scalar metric.
> *Example API:*
> {code:java}
> lr.eval <- spark_regression_evaluator(){code}
> h2. Miscellaneous Classes
> MLlib pipelines have a variety of miscellaneous classes that serve as hel=
pers and utilities. For example an object of ParamGridBuilder is used to bu=
ild a grid search pipeline. Another example is ClusteringSummary.
> *Example API:*
> {code:java}
> > grid <- param_grid_builder() %>%
> =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 add_grid(reg_param(lr), c(0.1, =
0.01)) %>%
> =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0=C2=A0add_grid(fit_intercept(lr)=
, c(TRUE, FALSE)) %>%
> =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0=C2=A0add_grid(elastic_net_param=
(lr), c(0.0, 0.5, 1.0))
> =C2=A0> model <- train_validation_split() %>%
> =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0=C2=A0set_estimator(lr) %>%
> =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0=C2=A0set_evaluator(spark_regres=
sion_evaluator()) %>%
> =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0=C2=A0set_estimator_param_maps(g=
rid) %>%
> =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0=C2=A0set_train_ratio(0.8) %>%
> =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0=C2=A0set_parallelism(2) %>%
> =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0=C2=A0fit(){code}
> Pipeline Persistence
> SparkML package will fix a longstanding issue with SparkR model persisten=
ce SPARK-15572. SparkML will directly wrap MLlib pipeline persistence API.=
=C2=A0
> *API example:*
> {code:java}
> > model <- pipeline %>% fit(training)
> > model %>% spark_write_pipeline(overwrite =3D TRUE, path =3D =E2=80=9C..=
.=E2=80=9D){code}
> h1. Design Sketch
> We propose using code generation from Scala to produce comprehensive API =
wrappers in R. For more details please see the attached design document.


--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org