spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Hyukjin Kwon (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (SPARK-20765) Cannot load persisted PySpark ML Pipeline that includes 3rd party stage (Transformer or Estimator) if the package name of stage is not "org.apache.spark" and "pyspark"
Date Tue, 21 May 2019 04:02:19 GMT

     [ https://issues.apache.org/jira/browse/SPARK-20765?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Hyukjin Kwon updated SPARK-20765:
---------------------------------
    Labels: bulk-closed  (was: )

> Cannot load persisted PySpark ML Pipeline that includes 3rd party stage (Transformer
or Estimator) if the package name of stage is not "org.apache.spark" and "pyspark"
> -----------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-20765
>                 URL: https://issues.apache.org/jira/browse/SPARK-20765
>             Project: Spark
>          Issue Type: Bug
>          Components: PySpark
>    Affects Versions: 2.0.0, 2.1.0, 2.2.0
>            Reporter: APeng Zhang
>            Priority: Major
>              Labels: bulk-closed
>
> When load persisted PySpark ML Pipeline instance, Pipeline._from_java() will invoke JavaParams._from_java()
to create Python instance of persisted stage. In JavaParams._from_java(), the name of python
class is derived from java class name by replace string "pyspark" with "org.apache.spark".
This is OK for ML Transformer and Estimator inside PySpark, but for 3rd party Transformer
and Estimator if package name is not org.apache.spark and pyspark, there will be an error:
>   File "/Users/azhang/Work/apyspark/lib/pyspark.zip/pyspark/ml/util.py", line 228, in
load
>     return cls.read().load(path)
>   File "/Users/azhang/Work/apyspark/lib/pyspark.zip/pyspark/ml/util.py", line 180, in
load
>     return self._clazz._from_java(java_obj)
>   File "/Users/azhang/Work/apyspark/lib/pyspark.zip/pyspark/ml/pipeline.py", line 160,
in _from_java
>     py_stages = [JavaParams._from_java(s) for s in java_stage.getStages()]
>   File "/Users/azhang/Work/apyspark/lib/pyspark.zip/pyspark/ml/wrapper.py", line 169,
in _from_java
>     py_type = __get_class(stage_name)
>   File "/Users/azhang/Work/apyspark/lib/pyspark.zip/pyspark/ml/wrapper.py", line 163,
in __get_class
>     m = __import__(module)
> ImportError: No module named com.abc.xyz.ml.testclass
> Related code in PySpark:
> In pyspark/ml/pipeline.py
> class Pipeline(Estimator, MLReadable, MLWritable):
>     @classmethod
>     def _from_java(cls, java_stage):
>         # Create a new instance of this stage.
>         py_stage = cls()
>         # Load information from java_stage to the instance.
>         py_stages = [JavaParams._from_java(s) for s in java_stage.getStages()]
> class JavaParams(JavaWrapper, Params):
>     @staticmethod
>     def _from_java(java_stage):
>         def __get_class(clazz):
>             """
>             Loads Python class from its name.
>             """
>             parts = clazz.split('.')
>             module = ".".join(parts[:-1])
>             m = __import__(module)
>             for comp in parts[1:]:
>                 m = getattr(m, comp)
>             return m
>         stage_name = java_stage.getClass().getName().replace("org.apache.spark", "pyspark")
>         # Generate a default new instance from the stage_name class.
>         py_type = __get_class(stage_name)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message