pig-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Pradeep Kamath (JIRA)" <j...@apache.org>
Subject [jira] Commented: (PIG-1131) Pig simple join does not work when it contains empty lines
Date Tue, 08 Dec 2009 01:24:18 GMT

    [ https://issues.apache.org/jira/browse/PIG-1131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12787234#action_12787234

Pradeep Kamath commented on PIG-1131:

Not sure, but looking through the code, the issue might be present even if first input tuple
has more fields than subsequent input tuples for either of the inputs in the join. The issue
is because in the optimization to not send parts of the value present in the key, in POLocalRearrange,
we further try to optimize by remembering which parts of the value were present in the key
for the first value - so if the next value has different number of fields, we hit the exception
seen in the description.

> Pig simple join does not work when it contains empty lines
> ----------------------------------------------------------
>                 Key: PIG-1131
>                 URL: https://issues.apache.org/jira/browse/PIG-1131
>             Project: Pig
>          Issue Type: Bug
>          Components: impl
>    Affects Versions: 0.7.0
>            Reporter: Viraj Bhat
>            Priority: Critical
>             Fix For: 0.7.0
>         Attachments: junk1.txt, junk2.txt, simplejoinscript.pig
> I have a simple script, which does a JOIN.
> {code}
> input1 = load '/user/viraj/junk1.txt' using PigStorage(' ');
> describe input1;
> input2 = load '/user/viraj/junk2.txt' using PigStorage('\u0001');
> describe input2;
> joineddata = JOIN input1 by $0, input2 by $0;
> describe joineddata;
> store joineddata into 'result';
> {code}
> The input data contains empty lines.  
> The join fails in the Map phase with the following error in the PRLocalRearrange.java
> java.lang.IndexOutOfBoundsException: Index: 1, Size: 1
> 	at java.util.ArrayList.RangeCheck(ArrayList.java:547)
> 	at java.util.ArrayList.get(ArrayList.java:322)
> 	at org.apache.pig.data.DefaultTuple.get(DefaultTuple.java:143)
> 	at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POLocalRearrange.constructLROutput(POLocalRearrange.java:464)
> 	at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POLocalRearrange.getNext(POLocalRearrange.java:360)
> 	at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POUnion.getNext(POUnion.java:162)
> 	at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.runPipeline(PigMapBase.java:253)
> 	at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:244)
> 	at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Map.map(PigMapReduce.java:94)
> 	at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
> 	at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358)
> 	at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307)
> 	at org.apache.hadoop.mapred.Child.main(Child.java:159)
> I am surprised that the test cases did not detect this error. Could we add this data
which contains empty lines to the testcases?
> Viraj

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message