From issues-return-193265-archive-asf-public=cust-asf.ponee.io@spark.apache.org Sat Jun 2 05:08:06 2018 Return-Path: X-Original-To: archive-asf-public@cust-asf.ponee.io Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by mx-eu-01.ponee.io (Postfix) with SMTP id BE4FC18063A for ; Sat, 2 Jun 2018 05:08:05 +0200 (CEST) Received: (qmail 91893 invoked by uid 500); 2 Jun 2018 03:08:04 -0000 Mailing-List: contact issues-help@spark.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Delivered-To: mailing list issues@spark.apache.org Received: (qmail 91884 invoked by uid 99); 2 Jun 2018 03:08:04 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd2-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 02 Jun 2018 03:08:04 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd2-us-west.apache.org (ASF Mail Server at spamd2-us-west.apache.org) with ESMTP id 457041A4D3A for ; Sat, 2 Jun 2018 03:08:04 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd2-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: -110.301 X-Spam-Level: X-Spam-Status: No, score=-110.301 tagged_above=-999 required=6.31 tests=[ENV_AND_HDR_SPF_MATCH=-0.5, RCVD_IN_DNSWL_MED=-2.3, SPF_PASS=-0.001, USER_IN_DEF_SPF_WL=-7.5, USER_IN_WHITELIST=-100] autolearn=disabled Received: from mx1-lw-eu.apache.org ([10.40.0.8]) by localhost (spamd2-us-west.apache.org [10.40.0.9]) (amavisd-new, port 10024) with ESMTP id DTQPr-HxIivD for ; Sat, 2 Jun 2018 03:08:02 +0000 (UTC) Received: from mailrelay1-us-west.apache.org (mailrelay1-us-west.apache.org [209.188.14.139]) by mx1-lw-eu.apache.org (ASF Mail Server at mx1-lw-eu.apache.org) with ESMTP id 953555F19B for ; Sat, 2 Jun 2018 03:08:01 +0000 (UTC) Received: from jira-lw-us.apache.org (unknown [207.244.88.139]) by mailrelay1-us-west.apache.org (ASF Mail Server at mailrelay1-us-west.apache.org) with ESMTP id 9C496E092E for ; Sat, 2 Jun 2018 03:08:00 +0000 (UTC) Received: from jira-lw-us.apache.org (localhost [127.0.0.1]) by jira-lw-us.apache.org (ASF Mail Server at jira-lw-us.apache.org) with ESMTP id 304B121096 for ; Sat, 2 Jun 2018 03:08:00 +0000 (UTC) Date: Sat, 2 Jun 2018 03:08:00 +0000 (UTC) From: "Hossein Falaki (JIRA)" To: issues@spark.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (SPARK-24359) SPIP: ML Pipelines in R MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/SPARK-24359?page=3Dcom.atlassia= n.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=3D164= 98842#comment-16498842 ]=20 Hossein Falaki commented on SPARK-24359: ---------------------------------------- Yes. My bad, I meant releasing an update to CRAN for every 2.x and 3.x rele= ase. However, if Spark does patch releases like 2.3.4, we are not required = to push a new CRAN package, but that is an opportunity. I guess that is ide= ntical to SparkR CRAN release cycle. > SPIP: ML Pipelines in R > ----------------------- > > Key: SPARK-24359 > URL: https://issues.apache.org/jira/browse/SPARK-24359 > Project: Spark > Issue Type: Improvement > Components: SparkR > Affects Versions: 3.0.0 > Reporter: Hossein Falaki > Priority: Major > Labels: SPIP > Attachments: SparkML_ ML Pipelines in R-v2.pdf, SparkML_ ML Pipel= ines in R-v3.pdf, SparkML_ ML Pipelines in R.pdf > > > h1. Background and motivation > SparkR supports calling MLlib functionality with an=C2=A0[R-friendly API|= https://docs.google.com/document/d/10NZNSEurN2EdWM31uFYsgayIPfCFHiuIu3pCWrU= mP_c/]. Since Spark 1.5 the (new) SparkML API which is based on=C2=A0[pipel= ines and parameters|https://docs.google.com/document/d/1rVwXRjWKfIb-7PI6b86= ipytwbUH7irSNLF1_6dLmh8o]=C2=A0has matured significantly. It allows users b= uild and maintain complicated machine learning pipelines. A lot of this fun= ctionality is difficult to expose using the simple formula-based API in Spa= rkR. > We propose a new R package, _SparkML_, to be distributed along with Spark= R as part of Apache Spark. This new package will be built on top of SparkR= =E2=80=99s APIs to expose SparkML=E2=80=99s pipeline APIs and functionality= . > *Why not SparkR?* > SparkR package contains ~300 functions. Many of these shadow functions in= base and other popular CRAN packages. We think adding more functions to Sp= arkR will degrade usability and make maintenance harder. > *Why not sparklyr?* > sparklyr is an R package developed by RStudio Inc. to expose Spark API to= R users. sparklyr includes MLlib API wrappers, but to the best of our know= ledge they are not comprehensive. Also we propose a code-gen approach for t= his package to minimize work needed to expose future MLlib API, but sparkly= r=E2=80=99s API is manually written. > h1. Target Personas > * Existing SparkR users who need more flexible SparkML API > * R users (data scientists, statisticians) who wish to build Spark ML pi= pelines in R > h1. Goals > * R users can install SparkML from CRAN > * R users will be able to import SparkML independent from SparkR > * After setting up a Spark session R users can > ** create a pipeline by chaining individual components and specifying th= eir parameters > ** tune a pipeline in parallel, taking advantage of Spark > ** inspect a pipeline=E2=80=99s parameters and evaluation metrics > ** repeatedly apply a pipeline > * MLlib contributors can easily add R wrappers for new MLlib Estimators = and Transformers > h1. Non-Goals > * Adding new algorithms to SparkML R package which do not exist in Scala > * Parallelizing existing CRAN packages > * Changing existing SparkR ML wrapping API > h1. Proposed API Changes > h2. Design goals > When encountering trade-offs in API, we will chose based on the following= list of priorities. The API choice that addresses a higher priority goal w= ill be chosen. > # *Comprehensive coverage of MLlib API:* Design choices that make R cove= rage of future ML algorithms difficult will be ruled out. > * *Semantic clarity*: We attempt to minimize confusion with other packag= es. Between consciousness and clarity, we will choose clarity. > * *Maintainability and testability:* API choices that require manual mai= ntenance or make testing difficult should be avoided. > * *Interoperability with rest of Spark components:* We will keep the R A= PI as thin as possible and keep all functionality implementation in JVM/Sca= la. > * *Being natural to R users:* Ultimate users of this package are R users= and they should find it easy and natural to use. > The API will follow familiar R function syntax, where the object is passe= d as the first argument of the method: =C2=A0do_something(obj, arg1, arg2).= All=C2=A0functions are snake_case (e.g., {{spark_logistic_regression()}}= =C2=A0and {{set_max_iter()}}). If a constructor gets arguments, they will b= e named arguments. For example: > {code:java} > > lr <- set_reg_param(set_max_iter(spark.logistic.regression()), 10), 0.1= ){code} > When calls need to be chained, like above example, syntax can nicely tran= slate to a natural pipeline style with help from very popular[ magrittr pac= kage|https://cran.r-project.org/web/packages/magrittr/index.html]. For exam= ple: > {code:java} > > logistic_regression() %>% set_max_iter(10) %>% set_reg_param(0.01) -> l= r{code} > h2. Namespace > All new API will be under a new CRAN package, named SparkML. The package = should be usable without needing SparkR in the namespace. The package will = introduce a number of S4 classes that inherit from four basic classes. Here= we will list the basic types with a few examples. An object of any child c= lass can be instantiated with a function call that starts with {{spark_}}. > h2. Pipeline & PipelineStage > A pipeline object contains one or more stages. =C2=A0 > {code:java} > > pipeline <- spark_pipeline() %>% set_stages(stage1, stage2, stage3){cod= e} > Where stage1, stage2, etc are S4 objects of a PipelineStage and pipeline = is an object of type Pipeline. > h2. Transformers > A Transformer is an algorithm that can transform one SparkDataFrame into = another SparkDataFrame. > *Example API:* > {code:java} > > tokenizer <- spark_tokenizer() %>% > =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0=C2=A0set_input_col(=E2=80=9Ctex= t=E2=80=9D) %>% > =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0=C2=A0set_output_col(=E2=80=9Cwo= rds=E2=80=9D) > > tokenized.df <- tokenizer %>% transform(df)=C2=A0{code} > h2. Estimators > An Estimator is an algorithm which can be fit on a SparkDataFrame to prod= uce a Transformer. E.g., a learning algorithm is an Estimator which trains = on a DataFrame and produces a model. > *Example API:* > {code:java} > lr <- spark_logistic_regression() %>% > =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0=C2=A0set_max_iter(10) %>% > =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0=C2=A0set_reg_param(0.001){code} > h2. Evaluators > An evaluator computes metrics from predictions (model outputs) and return= s a scalar metric. > *Example API:* > {code:java} > lr.eval <- spark_regression_evaluator(){code} > h2. Miscellaneous Classes > MLlib pipelines have a variety of miscellaneous classes that serve as hel= pers and utilities. For example an object of ParamGridBuilder is used to bu= ild a grid search pipeline. Another example is ClusteringSummary. > *Example API:* > {code:java} > > grid <- param_grid_builder() %>% > =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 add_grid(reg_param(lr), c(0.1, = 0.01)) %>% > =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0=C2=A0add_grid(fit_intercept(lr)= , c(TRUE, FALSE)) %>% > =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0=C2=A0add_grid(elastic_net_param= (lr), c(0.0, 0.5, 1.0)) > =C2=A0> model <- train_validation_split() %>% > =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0=C2=A0set_estimator(lr) %>% > =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0=C2=A0set_evaluator(spark_regres= sion_evaluator()) %>% > =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0=C2=A0set_estimator_param_maps(g= rid) %>% > =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0=C2=A0set_train_ratio(0.8) %>% > =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0=C2=A0set_parallelism(2) %>% > =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0=C2=A0fit(){code} > Pipeline Persistence > SparkML package will fix a longstanding issue with SparkR model persisten= ce SPARK-15572. SparkML will directly wrap MLlib pipeline persistence API.= =C2=A0 > *API example:* > {code:java} > > model <- pipeline %>% fit(training) > > model %>% spark_write_pipeline(overwrite =3D TRUE, path =3D =E2=80=9C..= .=E2=80=9D){code} > h1. Design Sketch > We propose using code generation from Scala to produce comprehensive API = wrappers in R. For more details please see the attached design document. -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org For additional commands, e-mail: issues-help@spark.apache.org