spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Apache Spark (JIRA)" <j...@apache.org>
Subject [jira] [Assigned] (SPARK-21094) Allow stdout/stderr pipes in pyspark.java_gateway.launch_gateway
Date Sat, 17 Jun 2017 23:49:00 GMT

     [ https://issues.apache.org/jira/browse/SPARK-21094?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Apache Spark reassigned SPARK-21094:
------------------------------------

    Assignee: Apache Spark

> Allow stdout/stderr pipes in pyspark.java_gateway.launch_gateway
> ----------------------------------------------------------------
>
>                 Key: SPARK-21094
>                 URL: https://issues.apache.org/jira/browse/SPARK-21094
>             Project: Spark
>          Issue Type: Improvement
>          Components: PySpark
>    Affects Versions: 2.1.1
>            Reporter: Peter Parente
>            Assignee: Apache Spark
>
> The Popen call to launch the py4j gateway specifies no stdout and stderr options, meaning
logging from the JVM always goes to the parent process terminal. 
> https://github.com/apache/spark/blob/v2.1.1/python/pyspark/java_gateway.py#L77
> It would be super handy if the launch_gateway function took an additional dict parameter
called popen_kwargs and passed it along to the Popen calls. This API enhancement, for example,
would allow Python applications to capture all stdout and stderr coming from Spark and process
it programmatically, without resorting to reading from log files or other hijinks.
> Example use:
> {code}
> import pyspark
> import subprocess
> from pyspark.java_gateway import launch_gateway
> # Make the py4j JVM stdout and stderr available without buffering
> popen_kwargs = {
>   'stdout': subprocess.PIPE,
>   'stderr': subprocess.PIPE,
>   'bufsiz': 0
> }
> # Launch the gateway with our custom settings
> gateway = launch_gateway(popen_kwargs=popen_kwargs)
> # Use the gateway we launched
> sc = pyspark.SparkContext(gateway=gateway)
> # This could be done in a thread or event loop or ...
> # Written briefly / poorly here only as a demo
> while True:
>   buf = gateway.proc.stdout.read()
>   print(buf.decode('utf-8'))
> {code}
> To get access to the stdout and stderr pipes, the "proc" instance created in launch_gateway
also needs to be exposed to the application. I'm thinking that stashing it on the JavaGateway
instance that the function already returns is the cleanest from the client perspective, but
means hanging an extra attribute off the py4j.JavaGateway object. 
> I can submit a PR with this addition for further discussion.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message