pig-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jonathan Coveney <jcove...@gmail.com>
Subject Has anyone else seen this problem? Post-group, a key ends up on multiple partitions
Date Fri, 02 Dec 2011 21:42:45 GMT
I have been looking into a pretty nasty bug, and while I haven't been able
to reproduce it outside of our dataset (I need to do more work on trying to
make that happen). Prepare to enter crazytown. This bug exists on pig8 and
pig9. The bug happens about 50% of the time on both. It ALWAYS affects the
same key, though the partition the keys are sent to vary.

I have a flow that looks like this:

x_and_y = foreach somedata generate source, sink;

x_and_y_grouped = group x_and_y by sink;

x_and_y_foreach = foreach x_and_y_grouped generate group as key,
COUNT(x_and_y) as ct, x_and_y.source;
store x_and_y_foreach into 'full';

x_and_y_pared_down = foreach x_and_y_foreach generate key, ct;
store x_and_y_pared_down into 'pared_down';

x_and_y_foreach_all = group x_and_y_foreach all;
x_and_y_foreach_stat = foreach x_and_y_foreach_all generate
MAX(x_and_y_foreach.key) as max_key, COUNT(x_and_y_foreach) as count,
SUM(x_and_y_foreach.ct) as sum;

store x_and_y_foreach_stat into 'sum';

Ok, here is where things get crazy: ~50% of the time, 'pared_down' will
have more rows than 'full.' Yeah. And x_and_y_foreach_stat will have the
wrong count. Looking at the output files, there is a key that in one output
part file, is (key,correct_count). And in another output part file, is

I have done many things to see what could cause this. I turned off all
optimizations, I turned off speculative execution, I turned off multiquery
optimization, I did this all in pig8 and pig9...and got the same error.

More crazy:

If we do the exact same but group on source instead of sink, we haven't
gotten the error yet.

Anyone have any ideas what this may be related to? Seen anything similar?
I'm going to try and reproduce on a non-proprietary data set, but given
that nobody has complained about this before, I imagine it's a really weird
corner case somewhere.


  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message