hadoop-pig-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Pradeep Kamath (JIRA)" <j...@apache.org>
Subject [jira] Created: (PIG-665) Map key type not correctly set (for use when key is null) when map plan does not have localrearrange
Date Wed, 11 Feb 2009 20:06:59 GMT
Map key type not correctly set (for use when key is null) when map plan does not have localrearrange
----------------------------------------------------------------------------------------------------

                 Key: PIG-665
                 URL: https://issues.apache.org/jira/browse/PIG-665
             Project: Pig
          Issue Type: Bug
    Affects Versions: types_branch
            Reporter: Pradeep Kamath
            Assignee: Pradeep Kamath
             Fix For: types_branch


KeyTypeDiscoveryVisitor visits the map plan to figure out the datatype of the map key. This
is required so that when the map key is null, we can still construct a valid NullableXXXWritable
object to pass on to hadoop in the collect() call (hadoop needs a valid object even for null
objects). Currently the KeyTypeDiscoveryVisitor only looks at POPackage and POLocalRearrange
to figure out the key type. In a pig script which results in multiple Map reduce jobs, one
of the jobs could have a map plan with only POLoads in it. In such a case, the map key type
is not discovered and this results in a null being returned from HDataType.getWritableComparableTypes()
method. This in turn will result in a NullPointerException in the collect().

Here is a script which can prompt this behavior:
{code}
a = load 'a.txt' as (x:int, y:int, z:int);
b = load 'b.txt' as (x:int, y:int);
b_group = group b by x;
b_sum = foreach b_group generate flatten(group) as x, SUM(b.y) as clicks;
a_group = group a by (x, y);
a_aggs = foreach a_group {
            generate 
                flatten(group) as (x, y),
                SUM(a.z) as zs;
                };
join_a_b = join b_sum by x, a_aggs by x; --> the map plan for this join will only have
two POLoads which will result in the NullPointerException at runtime in collect()
dump join_a_b;

{code} 

Contents of a.txt (columns are tab separated):
The first column of the first two rows is null (represented by an empty column)
{noformat}
        7       8
        8       9
1       20      30
1       20      40
{noformat}

Contents of b.txt (columns are tab separated):
{noformat}
7       2
1       5
1       10
{noformat}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message