hadoop-pig-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Richard Ding (JIRA)" <j...@apache.org>
Subject [jira] Commented: (PIG-1188) Padding nulls to the input tuple according to input schema
Date Fri, 19 Feb 2010 20:46:28 GMT

    [ https://issues.apache.org/jira/browse/PIG-1188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12835944#action_12835944

Richard Ding commented on PIG-1188:

To summarize where we are:

Right now Pig project operator pads null if the value to be projected doesn't exist. As a
consequence, the desired result is achieved if  PigStorage is used and a schema with data
types is specified, since in this case Pig inserts a project+cast operator for each field
in the schema.

In the case where no schema is specified in the load statement, Pig is doing a good job adhering
to the Pig's philosophy and  let the program run without throwing runtime exception.

Now leave the case where a schema is specified without data types. There are several options:

   * Pig automatically insert a project operator for each field in the schema to ensure the
input data matches the schema. The trade-off for this is the performance penalty. Is it worthwhile
if most user data is well-behaved?

   * Users can explicitly add a foreach statement after the load statement which projects
all the fields in the schema. This is similar to the practice by the users to run a map job
first to cleanup the data.  

   * Pig can also delegate the padding work to the loaders. The problem is that now  the schema
isn't passed to the loaders. 

> Padding nulls to the input tuple according to input schema
> ----------------------------------------------------------
>                 Key: PIG-1188
>                 URL: https://issues.apache.org/jira/browse/PIG-1188
>             Project: Pig
>          Issue Type: Bug
>          Components: impl
>    Affects Versions: 0.6.0
>            Reporter: Daniel Dai
>            Assignee: Richard Ding
>             Fix For: 0.7.0
> Currently, the number of fields in the input tuple is determined by the data. When we
have schema, we should generate input data according to the schema, and padding nulls if necessary.
Here is one example:
> Pig script:
> {code}
> a = load '1.txt' as (a0, a1);
> dump a;
> {code}
> Input file:
> {code}
> 1       2
> 1       2       3
> 1
> {code}
> Current result:
> {code}
> (1,2)
> (1,2,3)
> (1)
> {code}
> Desired result:
> {code}
> (1,2)
> (1,2)
> (1, null)
> {code}

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message