hadoop-pig-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Pradeep Kamath (JIRA)" <j...@apache.org>
Subject [jira] Created: (PIG-665) Map key type not correctly set (for use when key is null) when map plan does not have localrearrange
Date Wed, 11 Feb 2009 20:06:59 GMT
Map key type not correctly set (for use when key is null) when map plan does not have localrearrange

                 Key: PIG-665
                 URL: https://issues.apache.org/jira/browse/PIG-665
             Project: Pig
          Issue Type: Bug
    Affects Versions: types_branch
            Reporter: Pradeep Kamath
            Assignee: Pradeep Kamath
             Fix For: types_branch

KeyTypeDiscoveryVisitor visits the map plan to figure out the datatype of the map key. This
is required so that when the map key is null, we can still construct a valid NullableXXXWritable
object to pass on to hadoop in the collect() call (hadoop needs a valid object even for null
objects). Currently the KeyTypeDiscoveryVisitor only looks at POPackage and POLocalRearrange
to figure out the key type. In a pig script which results in multiple Map reduce jobs, one
of the jobs could have a map plan with only POLoads in it. In such a case, the map key type
is not discovered and this results in a null being returned from HDataType.getWritableComparableTypes()
method. This in turn will result in a NullPointerException in the collect().

Here is a script which can prompt this behavior:
a = load 'a.txt' as (x:int, y:int, z:int);
b = load 'b.txt' as (x:int, y:int);
b_group = group b by x;
b_sum = foreach b_group generate flatten(group) as x, SUM(b.y) as clicks;
a_group = group a by (x, y);
a_aggs = foreach a_group {
                flatten(group) as (x, y),
                SUM(a.z) as zs;
join_a_b = join b_sum by x, a_aggs by x; --> the map plan for this join will only have
two POLoads which will result in the NullPointerException at runtime in collect()
dump join_a_b;


Contents of a.txt (columns are tab separated):
The first column of the first two rows is null (represented by an empty column)
        7       8
        8       9
1       20      30
1       20      40

Contents of b.txt (columns are tab separated):
7       2
1       5
1       10

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message