hadoop-pig-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Alan Gates <ga...@yahoo-inc.com>
Subject Semantics of empty bags in Foreach
Date Mon, 10 Nov 2008 21:29:52 GMT
The JIRA https://issues.apache.org/jira/browse/PIG-514 has brought up  
an interesting issue of how we handle empty bags in foreach  
statements.  The current pig semantic for foreach is that it always  
produces a cross produce of all of the fields in its projection  
list.  So:

B = foreach A generate $0, $1;

technically produces a cross product of $0 and $1.  Since both $0 and  
$1 are (generally) single valued this produces one row.  In cases  
where they are multi-valued (generate flatten($0), $1) then the cross  
product produces multiple rows.  In cases where any of the elements  
are an empty bag, the cross product produces no row.  That is,  
emptyness is equivalent to a 0 in multiplication, it swallows  

Because of this, pig is currently implemented such that as soon as it  
sees an empty bag in an output it stops, because there's no point in  
continuing.  So, scripts like:

A = load 'myfile';
B = group A by $0;
C = foreach B generate {
     D = filter A by $1 > 5;
     E = filter D by $1 < 5;
     generate COUNT(E.$0), group;

will generate no output all.  It would be reasonable to expect that  
the above would produce a list of all entries from the first field of  
'myfile', along with a 0 (for the count).

A couple of questions about this:

1) Should we keep this empty bag as a blackhole semantic?  It strikes  
me as reasonable that instead of being a blackhole it instead  
produces a NULL value.  This would make outer joins somewhat easier  
to do.  I'm not sure what other side effects it would have.

2) If we do keep the blackhole semantic, should UDFs get a chance to  
evaluate an empty bag?  The current implementation certainly seems to  
violate the law of least astonishment.  However, if we extend this to  
UDFs we need to think carefully about what else it needs extended  
to.  In particular, the semantic for streaming is that if we have no  
data, we will not envoke the external binary.  It seems we should be  
consistent throughout.  Any empty bag should either mean that we stop  
processing there and return nothing, or that we allow user provided  
code a chance to run, even without input.

Thoughts?  Insights?


View raw message