pig-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Cheolsoo Park (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (PIG-4227) Streaming Python UDF handles bag outputs incorrectly
Date Wed, 15 Oct 2014 17:17:33 GMT

    [ https://issues.apache.org/jira/browse/PIG-4227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14172620#comment-14172620
] 

Cheolsoo Park commented on PIG-4227:
------------------------------------

[~daijy], sorry for breaking unit tests.
{quote}
I don't totally understand the issue in the description, is that because jython adds tuple
inside a list automatically but python does not?
{quote}
You're right that Jython udf usually doesn't return a list of Python tuples but just returns
a list of Python objects. In that case, Pig converts it to a bag of tuples automatically by
wrapping objects with tuples. However, Python streaming udf serializes it as a bag of non-tuples,
and they're never wrapped with tuples. The problem is that outputSchema is defined as something
like {{bag:\{tuple\:( chararray )\}}}, and now deserialization code skips bytes to skip tuple
delimiters that do not exist. That results in truncating 3 chars at the beginning and the
end.

So the root cause is that Jython and Python streaming handles a Python list of non-tuples
differently. This makes it not possible to run the same udf in the two modes. With my patch,
I can run the same udf in the two modes and get the same result. For eg, here is the diff
in one of udfs before and after my patch. This should clarify the difference-
{code}
34c34
<                             output.append(recos[r]['id'])
---
>                             output.append(tuple([recos[r]['id']]))
44c44
<                             output.append(recos[r]['id'])
---
>                             output.append(tuple([recos[r]['id']]))
49c49
<                     output.append(items[i]['id'])
---
>                     output.append(tuple([items[i]['id']]))
84c84
<                             output.append(recos[r]['id'])
---
>                             output.append(tuple([recos[r]['id']]))
96c96
<                             output.append(recos[r]['id'])
---
>                             output.append(tuple([recos[r]['id']]))
101c101
<                     output.append(items[i]['id'])
---
>                     output.append(tuple([items[i]['id']]))
105c105
<                 return [-1]
---
>                 return [tuple([-1])]
{code}

> Streaming Python UDF handles bag outputs incorrectly
> ----------------------------------------------------
>
>                 Key: PIG-4227
>                 URL: https://issues.apache.org/jira/browse/PIG-4227
>             Project: Pig
>          Issue Type: Bug
>            Reporter: Cheolsoo Park
>            Assignee: Cheolsoo Park
>             Fix For: 0.14.0
>
>         Attachments: PIG-4227-1.patch
>
>
> I have a udf that generates different outputs when running as jython and streaming python.
> {code:title=jython}
> {([[BBC Worldwide]])}
> {code} 
> {code:title=streaming python}
> {(BC Worldwid)}
> {code}
> The problem is that streaming python encodes a bag output incorrectly. For this particular
example, it serializes the output string as follows-
> {code}
> |{_[[BBC Worldwide]]|}_
> {code}
> where '|' and '\_' wrap bag delimiters '\{' and '\}'. i.e. '\{' => '|\{\_' and '\}'
=> '|\}\_'.
> But this is wrong because bag must contain tuples not chararrays. i.e. the correct encoding
is as follows-
> {code}
> |{_|(_[[BBC Worldwide]]|)_|}_
> {code}
> where '|' and '_' wrap tuple delimiters '(' and ')' as well as bag delimiters.
> This results in truncated outputs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message