hadoop-pig-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Daniel Dai (JIRA)" <j...@apache.org>
Subject [jira] Commented: (PIG-1595) casting relation to scalar- problem with handling of data from non PigStorage loaders
Date Fri, 03 Sep 2010 21:37:33 GMT

    [ https://issues.apache.org/jira/browse/PIG-1595?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12906116#action_12906116
] 

Daniel Dai commented on PIG-1595:
---------------------------------

Patch looks good. This patch is to address the problem that we cannot get output schema of
the scalar UDF at compile time. Another approach is write ReadScalars.outputSchema(), and
use the input schema to figure out the output schema. But again we need to address the dependency
to make sure input schema is correctly set before calling outputSchema(). So both approach
should be equivalent.

> casting relation to scalar- problem with handling of data from non PigStorage loaders
> -------------------------------------------------------------------------------------
>
>                 Key: PIG-1595
>                 URL: https://issues.apache.org/jira/browse/PIG-1595
>             Project: Pig
>          Issue Type: Bug
>            Reporter: Thejas M Nair
>            Assignee: Thejas M Nair
>             Fix For: 0.8.0
>
>         Attachments: PIG-1595.1.patch
>
>
> If load functions that don't follow the same bytearray format as PigStorage for other
supported datatypes, or those that don't implement the LoadCaster interface are used in 'casting
relation to scalar' (PIG-1434), it can cause the query to fail or create incorrect results.
> The root cause of the problem is that there is a real dependency between the ReadScalars
udf that returns the scalar value and the LogicalOperator that acts as its input. But the
logicalplan does not capture this dependency. So in SchemaResetter visitor used by the optimizer,
the order in which schema is reset and evaluated does not take this into consideration. If
the schema of the input LogicalOperator does not get evaluated before the ReadScalar udf,
the resutltype of ReadScalar udf becomes bytearray. POUserFunc will convert the input to bytearray
using ' new DataByteArray(inp.toString().getBytes())'. But this bytearray encoding of other
supported types might not be same for the LoadFunction associated with the column, and that
can result in problems.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message