spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Lucas Partridge (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (SPARK-19498) Discussion: Making MLlib APIs extensible for 3rd party libraries
Date Mon, 20 Aug 2018 13:03:00 GMT

    [ https://issues.apache.org/jira/browse/SPARK-19498?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16527351#comment-16527351
] 

Lucas Partridge edited comment on SPARK-19498 at 8/20/18 1:02 PM:
------------------------------------------------------------------

Ok great. Here's my feedback after wrapping a large complex Python algorithm for ML Pipeline
on Spark 2.2.0. Several of these comments probably apply beyond pyspark too.
 # The inability to save and load custom pyspark models/pipelines/pipelinemodels is an absolute
showstopper. Training models can take hours and so we need to be able to save and reload models.
Pending the availability of https://issues.apache.org/jira/browse/SPARK-17025 I used a refinement
of [https://stackoverflow.com/a/49195515/1843329] to work around this. Had this not been
solved no further work would have been done.


 # Support for saving/loading more param types (e.g., dict) would be great. I had to use json.dumps
to convert our algorithm's internal model into a string and then pretend it was a string param
in order to save and load that with the rest of the transformer.


 # Given that pipelinemodels can be saved we also need the ability to export them easily for
deployment on other clusters. The cluster where you train the model may be different to the
one where you deploy it for predictions. A hack workaround is to use hdfs commands to copy
the relevant files and directories but it would be great if we had simple single export/import
commands in pyspark to move models.pipelines/pipelinemodels easily between clusters and to
allow artifacts to be stored off-cluster.


 # Creating individual parameters with getters and setters is tedious and error-prone, especially
if writing docs inline too. It would be great if as much of this boiler-plate as possible
could be auto-generated from a simple parameter definition. I always groan when someone asks
for an extra param at the moment!


 # The Ml Pipeline API seems to assume all the params lie on the estimator and none on the
transformer. In the algorithm I wrapped the model/transformer has numerous params that are
specific to it rather than the estimator. PipelineModel needs a getStages() command (just
as Pipeline does) to get at the model so you can parameterise it. I had to use the undocumented
.stages member instead.  But then if you want to call transform() on a pipelinemodel immediately
after fitting it you also need some ability to set the model/transformer params in advance.
I got around this by defining a params class for the estimator-only params and another class
for the model-only params. I made the estimator inherit from both these classes and the model
inherit from only the model-base params class. The estimator then just passes through any
model-specific params to the model when it creates it at the end of its fit() method. But,
to distinguish the model-only params from the estimator (e.g., when listing the params on
the estimator) I had to prefix all the model-only params with a common value to identify them.
This works but it's clunky and ugly.


 # The algorithm I ported works naturally with individually named column inputs. But the existing
ML Pipeline library prefers DenseVectors. I ended up having to support both types of inputs
- if a DenseVector input was 'None' I would take the data directly from the individually named 
columns instead. If users want to use the algorithm by itself they can used the column-based
input approach; if they want to work with algorithms from the built-in library (e.g., StandardScaler,
Binarizer, etc) they can use the DenseVector approach instead.  Again this works but is clunky
because you're having to handle two different forms of input inside the same implementation.
Also DenseVectors are limited by their inability to handle missing values.


 # Similarly, I wanted to produce multiple separate columns for the outputs of the model's
transform() method whereas most built-in algorithms seem to use a single DenseVector output
column. DataFrame's withColumn() method could do with a withColumns() equivalent to make it
easy to add multiple columns to a Dataframe instead of just one column at a time.


 # Documentation explaining how to create a custom estimator and transformer (preferably one
with transformer-specific params) would be extremely useful for people. Most of what I learned
I gleaned off StackOverflow and from looking at Spark's pipeline code.

Hope this list will be useful for improving ML Pipelines in future versions of Spark!


was (Author: lucas.partridge):
Ok great. Here's my feedback after wrapping a large complex Python algorithm for ML Pipeline
on Spark 2.2.0. Several of these comments probably apply beyond pyspark too.
 # The inability to save and load custom pyspark models/pipelines/pipelinemodels is an absolute
showstopper. Training models can take hours and so we need to be able to save and reload models.
Pending the availability of https://issues.apache.org/jira/browse/SPARK-17025 I used a refinement
of [https://stackoverflow.com/a/49195515/1843329] to work around this. Had this not been
solved no further work would have been done.
 # Support for saving/loading more param types would be great. I had to use json.dumps to
convert our algorithm's internal model into a string and then pretend it was a string param
in order to save and load that with the rest of the transformer.
 # Given that pipelinemodels can be saved we also need the ability to export them easily for
deployment on other clusters. The cluster where you train the model may be different to the
one where you deploy it for predictions. A hack workaround is to use hdfs commands to copy
the relevant files and directories but it would be great if we had simple single export/import
commands in pyspark to move models.pipelines/pipelinemodels easily between clusters and to
allow artifacts to be stored off-cluster.
 # Creating individual parameters with getters and setters is tedious and error-prone, especially
if writing docs inline too. It would be great if as much of this boiler-plate as possible
could be auto-generated from a simple parameter definition. I always groan when someone asks
for an extra param at the moment!
 # The Ml Pipeline API seems to assume all the params lie on the estimator and none on the
transformer. In the algorithm I wrapped the model/transformer has numerous params that are
specific to it rather than the estimator. PipelineModel needs a getStages() command (just
as Pipeline does) to get at the model so you can parameterise it. I had to use the undocumented
.stages member instead.  But then if you want to call transform() on a pipelinemodel immediately
after fitting it you also need some ability to set the model/transformer params in advance.
I got around this by defining a params class for the estimator-only params and another class
for the model-only params. I made the estimator inherit from both these classes and the model
inherit from only the model-base params class. The estimator then just passes through any
model-specific params to the model when it creates it at the end of its fit() method. But,
to distinguish the model-only params from the estimator (e.g., when listing the params on
the estimator) I had to prefix all the model-only params with a common value to identify them.
This works but it's clunky and ugly.
 # The algorithm I ported works naturally with individually named column inputs. But the existing
ML Pipeline library prefers DenseVectors. I ended up having to support both types of inputs
- if a DenseVector input was 'None' I would take the data directly from the individually named 
columns instead. If users want to use the algorithm by itself they can used the column-based
input approach; if they want to work with algorithms from the built-in library (e.g., StandardScaler,
Binarizer, etc) they can use the DenseVector approach instead.  Again this works but is clunky
because you're having to handle two different forms of input inside the same implementation.
Also DenseVectors are limited by their inability to handle missing values.
 # Similarly, I wanted to produce multiple separate columns for the outputs of the model's
transform() method whereas most built-in algorithms seem to use a single DenseVector output
column. DataFrame's withColumn() method could do with a withColumns() equivalent to make it
easy to add multiple columns to a Dataframe instead of just one column at a time.
 # Documentation explaining how to create a custom estimator and transformer (preferably one
with transformer-specific params) would be extremely useful for people. Most of what I learned
I gleaned off StackOverflow and from looking at Spark's pipeline code.

Hope this list will be useful for improving ML Pipelines in future versions of Spark!

> Discussion: Making MLlib APIs extensible for 3rd party libraries
> ----------------------------------------------------------------
>
>                 Key: SPARK-19498
>                 URL: https://issues.apache.org/jira/browse/SPARK-19498
>             Project: Spark
>          Issue Type: Brainstorming
>          Components: ML
>    Affects Versions: 2.2.0
>            Reporter: Joseph K. Bradley
>            Priority: Critical
>
> Per the recent discussion on the dev list, this JIRA is for discussing how we can make
MLlib DataFrame-based APIs more extensible, especially for the purpose of writing 3rd-party
libraries with APIs extended from the MLlib APIs (for custom Transformers, Estimators, etc.).
> * For people who have written such libraries, what issues have you run into?
> * What APIs are not public or extensible enough?  Do they require changes before being
made more public?
> * Are APIs for non-Scala languages such as Java and Python friendly or extensive enough?
> The easy answer is to make everything public, but that would be terrible of course in
the long-term.  Let's discuss what is needed and how we can present stable, sufficient, and
easy-to-use APIs for 3rd-party developers.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message