hadoop-pig-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Alan Gates (JIRA)" <j...@apache.org>
Subject [jira] Commented: (PIG-1065) In-determinate behaviour of Union when there are 2 non-matching schema's
Date Tue, 10 Nov 2009 17:53:27 GMT

    [ https://issues.apache.org/jira/browse/PIG-1065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12775961#action_12775961

Alan Gates commented on PIG-1065:

As originally defined UNION does allow two inputs to be of different schema, the result of
which should have no schema.  So the error here is really that the runtime system doesn't
adapt to the unexpected type.  I'm not sure how useful this functionality is, but it is in
keeping with the original spirit of Pig's no schema required, so I'm not inclined to fix it.
 At some point in the future we should consider how general we want to be in these cases,
as there is significant cost for it.  But let's decide it as a whole for the language rather
than piecemeal.

I would support the addition of a conforming union, (I have no idea what to call it) that
requires that each input have either the same schema or that the two schemas be somehow compatible.
 It would then handle type promotion and adding nulls for missing columns.

Finally, in response to Santhosh's comment on AS, I think whether we extend AS to statements
beyond LOAD depends on choices we make above about schemas or lack thereof.  But removing
it from the foreach isn't easy.  Consider the following:

B = foreach A generate $0, flatten($1), $2, flatten($3) AS (a:int, b:int, c:float, d:long,

If we don't have a schema for $1 and $3, we don't know whether $2 should end up being c:float
or d:long.

> In-determinate behaviour of Union when there are 2 non-matching schema's
> ------------------------------------------------------------------------
>                 Key: PIG-1065
>                 URL: https://issues.apache.org/jira/browse/PIG-1065
>             Project: Pig
>          Issue Type: Bug
>    Affects Versions: 0.6.0
>            Reporter: Viraj Bhat
>             Fix For: 0.6.0
> I have a script which first does a union of these schemas and then does a ORDER BY of
this result.
> {code}
> f1 = LOAD '1.txt' as (key:chararray, v:chararray);
> f2 = LOAD '2.txt' as (key:chararray);
> u0 = UNION f1, f2;
> describe u0;
> dump u0;
> u1 = ORDER u0 BY $0;
> dump u1;
> {code}
> When I run in Map Reduce mode I get the following result:
> $java -cp pig.jar:$HADOOP_HOME/conf org.apache.pig.Main broken.pig
> ====================
> Schema for u0 unknown.
> ====================
> (1,2)
> (2,3)
> (1)
> (2)
> ====================
> org.apache.pig.impl.logicalLayer.FrontendException: ERROR 1066: Unable to open iterator
for alias u1
>         at org.apache.pig.PigServer.openIterator(PigServer.java:475)
>         at org.apache.pig.tools.grunt.GruntParser.processDump(GruntParser.java:532)
>         at org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:190)
>         at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:166)
>         at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:142)
>         at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:89)
>         at org.apache.pig.Main.main(Main.java:397)
> ====================
> Caused by: java.io.IOException: Type mismatch in key from map: expected org.apache.pig.impl.io.NullableBytesWritable,
recieved org.apache.pig.impl.io.NullableText
>         at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java:415)
>         at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Map.collect(PigMapReduce.java:108)
>         at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.runPipeline(PigMapBase.java:251)
>         at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:240)
>         at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Map.map(PigMapReduce.java:93)
>         at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:47)
>         at org.apache.hadoop.mapred.MapTask.run(MapTask.java:227)
> ====================
> When I run the same script in local mode I get a different result, as we know that local
mode does not use any Hadoop Classes.
> $java -cp pig.jar org.apache.pig.Main -x local broken.pig
> ====================
> Schema for u0 unknown
> ====================
> (1,2)
> (1)
> (2,3)
> (2)
> ====================
> (1,2)
> (1)
> (2,3)
> (2)
> ====================
> Here are some questions
> 1) Why do we allow union if the schemas do not match
> 2) Should we not print an error message/warning so that the user knows that this is not
allowed or he can get unexpected results?
> Viraj

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message