pig-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Adam Szita (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (PIG-5177) Scripting and StreamingPythonUDFs fail with Spark exec type
Date Fri, 10 Mar 2017 10:23:04 GMT

    [ https://issues.apache.org/jira/browse/PIG-5177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15904852#comment-15904852
] 

Adam Szita commented on PIG-5177:
---------------------------------

[~kellyzly]
Basically the issue is caused by backend not being able to find the script file (e.g. in _ScriptEngine#getScriptAsStream_)

1. This is only an issue in yarn-client mode, in local mode it works because the script file
is available in the local FS at its original location

2. Script files have to be carried along to (backend) executor nodes. This is done differently
in MR/Tez vs Spark mode.
In all cases the script file paths are available in pigContext().getScriptFiles() (after they
were registered on the frontend). In MR/Tez modes _JarManager#createPigScriptUDFJar(PigContext)_
will create a jar file and put the script files into it. This jar will be distributed among
backend nodes, and upon job execution they will be accessed with a ClassLoader. (e.g here:
https://github.com/apache/pig/blob/spark/src/org/apache/pig/scripting/ScriptEngine.java#L146)
In Spark we use _LoadConverter#registerUdfFiles_ on the frontend and let Spark do the job
of distributing the script files to executor nodes. Later on the backend an executor can retrieve
the path of the script file using SparkFiles.get(originalFileName). This will point to the
file in the executor's container, and we can use this to open a FileInputStream on it.

This patch solves about 30 E2E test case failures, since this is a common problem among the
scripting functionalities.

> Scripting and StreamingPythonUDFs fail with Spark exec type
> -----------------------------------------------------------
>
>                 Key: PIG-5177
>                 URL: https://issues.apache.org/jira/browse/PIG-5177
>             Project: Pig
>          Issue Type: Sub-task
>          Components: spark
>            Reporter: Adam Szita
>            Assignee: Adam Szita
>             Fix For: spark-branch
>
>         Attachments: PIG-5177.0.patch, PIG-5177.1.patch, PIG-5177.2.patch
>
>
> We are thrown an exception because the Python script file is not found on the backend
side (on spark executors).



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Mime
View raw message