hadoop-pig-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Pradeep Kamath (JIRA)" <j...@apache.org>
Subject [jira] Updated: (PIG-665) Map key type not correctly set (for use when key is null) when map plan does not have localrearrange
Date Thu, 12 Feb 2009 22:54:59 GMT

     [ https://issues.apache.org/jira/browse/PIG-665?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Pradeep Kamath updated PIG-665:
-------------------------------

      Resolution: Fixed
    Hadoop Flags: [Reviewed]
          Status: Resolved  (was: Patch Available)

Patch committed.

> Map key type not correctly set (for use when key is null) when map plan does not have
localrearrange
> ----------------------------------------------------------------------------------------------------
>
>                 Key: PIG-665
>                 URL: https://issues.apache.org/jira/browse/PIG-665
>             Project: Pig
>          Issue Type: Bug
>    Affects Versions: types_branch
>            Reporter: Pradeep Kamath
>            Assignee: Pradeep Kamath
>             Fix For: types_branch
>
>         Attachments: PIG-665.patch
>
>
> KeyTypeDiscoveryVisitor visits the map plan to figure out the datatype of the map key.
This is required so that when the map key is null, we can still construct a valid NullableXXXWritable
object to pass on to hadoop in the collect() call (hadoop needs a valid object even for null
objects). Currently the KeyTypeDiscoveryVisitor only looks at POPackage and POLocalRearrange
to figure out the key type. In a pig script which results in multiple Map reduce jobs, one
of the jobs could have a map plan with only POLoads in it. In such a case, the map key type
is not discovered and this results in a null being returned from HDataType.getWritableComparableTypes()
method. This in turn will result in a NullPointerException in the collect().
> Here is a script which can prompt this behavior:
> {code}
> a = load 'a.txt' as (x:int, y:int, z:int);
> b = load 'b.txt' as (x:int, y:int);
> b_group = group b by x;
> b_sum = foreach b_group generate flatten(group) as x, SUM(b.y) as clicks;
> a_group = group a by (x, y);
> a_aggs = foreach a_group {
>             generate 
>                 flatten(group) as (x, y),
>                 SUM(a.z) as zs;
>                 };
> join_a_b = join b_sum by x, a_aggs by x; --> the map plan for this join will only
have two POLoads which will result in the NullPointerException at runtime in collect()
> dump join_a_b;
> {code} 
> Contents of a.txt (columns are tab separated):
> The first column of the first two rows is null (represented by an empty column)
> {noformat}
>         7       8
>         8       9
> 1       20      30
> 1       20      40
> {noformat}
> Contents of b.txt (columns are tab separated):
> {noformat}
> 7       2
> 1       5
> 1       10
> {noformat}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message