hadoop-pig-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Alan Gates (JIRA)" <j...@apache.org>
Subject [jira] Commented: (PIG-1188) Padding nulls to the input tuple according to input schema
Date Wed, 03 Feb 2010 02:06:19 GMT

    [ https://issues.apache.org/jira/browse/PIG-1188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12828896#action_12828896

Alan Gates commented on PIG-1188:

After further thought I want to change my position on this.

There are two cases to consider, when schema is present and when it isn't.  The problem is
by the time Pig is trying to access the missing field (in the backend), it has no idea whether
the schema exists or not.  So at runtime, Pig should just return a null if it gets ArrayOutOfBoundsException.

How to pad missing data should be left up to the load function.  Perhaps certain load functions
do know how to pad missing data, or are ok with the pad at the end scheme proposed here. 
If the load function does not check, then Pig would effectively pad at the end, given the
proposal above.  If the load function implementer does not what this to happen, s/he can check
each tuple being read from the input to assure it matches the schema, and then decide to pad
the tuple with nulls, reject the tuple, or return a tuple full of nulls.

In the case of PigStorage, checking each tuple for a match against the schema is too expensive.
 Ideally I would like it to, because I think that when the user gives a schema it's an error
if the data doesn't match.  But I don't want to pay the performance penalty in this case.

> Padding nulls to the input tuple according to input schema
> ----------------------------------------------------------
>                 Key: PIG-1188
>                 URL: https://issues.apache.org/jira/browse/PIG-1188
>             Project: Pig
>          Issue Type: Bug
>          Components: impl
>    Affects Versions: 0.6.0
>            Reporter: Daniel Dai
>             Fix For: 0.7.0
> Currently, the number of fields in the input tuple is determined by the data. When we
have schema, we should generate input data according to the schema, and padding nulls if necessary.
Here is one example:
> Pig script:
> {code}
> a = load '1.txt' as (a0, a1);
> dump a;
> {code}
> Input file:
> {code}
> 1       2
> 1       2       3
> 1
> {code}
> Current result:
> {code}
> (1,2)
> (1,2,3)
> (1)
> {code}
> Desired result:
> {code}
> (1,2)
> (1,2)
> (1, null)
> {code}

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message