spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "holdenk (JIRA)" <j...@apache.org>
Subject [jira] [Resolved] (SPARK-23842) accessing java from PySpark lambda functions
Date Thu, 26 Apr 2018 17:25:00 GMT

     [ https://issues.apache.org/jira/browse/SPARK-23842?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

holdenk resolved SPARK-23842.
-----------------------------
    Resolution: Won't Fix

Not supported by the current design, alternatives do exist though.

> accessing java from PySpark lambda functions
> --------------------------------------------
>
>                 Key: SPARK-23842
>                 URL: https://issues.apache.org/jira/browse/SPARK-23842
>             Project: Spark
>          Issue Type: Improvement
>          Components: PySpark
>    Affects Versions: 2.2.0, 2.2.1, 2.3.0
>            Reporter: Ruslan Dautkhanov
>            Priority: Major
>              Labels: cloudpickle, py4j, pyspark
>
> Copied from https://github.com/bartdag/py4j/issues/311 but it seems to be more of a Spark
issue than py4j.. 
> |We have a third-party Java library that is distributed to Spark executors through {{--jars}} parameter.
> We want to call a static Java method in that library on executor's side through Spark's {{map()}} or
create an object of that library's class through {{mapPartitions()}} call. 
> None of the approaches worked so far. It seems Spark tries to serialize everything it
sees in a lambda function, distribute to executors etc.
> I am aware of an older py4j issue/question [#171|https://github.com/bartdag/py4j/issues/171] but
looking at that discussion isn't helpful.
> We thought to create a reference to that "class" through a call like {{genmodel = spark._jvm.hex.genmodel}}and
then operate through py4j to expose functionality of that library in pyspark executors' lambda
functions.
> We don't want Spark to try to serialize spark session variables "spark" nor its reference
to py4j gateway {{spark._jvm}} (because it leads to expected non-serializable exceptions),
so tried to "trick" Spark not to try to serialize those by nested the above {{genmodel =
spark._jvm.hex.genmodel}} into {{exec()}} call.
> It led to another issue that {{spark}} (spark session) nor {{sc}} (spark context)
variables seems not available in spark executors' lambda functions. So we're stuck and don't
know how to call a generic java class through py4j on executor's side (from within {{map}} or {{mapPartitions}} lambda
functions).
> It would be an easier adventure from Scala/ Java for Spark as those can directly call
that 3rd-party libraries methods, but our users ask to have a way to do the same from PySpark.|



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message