hadoop-pig-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Swati Jain <swat...@aggiemail.usu.edu>
Subject PIG Logical Optimization: Use CNF in SplitFilter
Date Mon, 05 Jul 2010 09:34:15 GMT

I am interested in implementing logical optimization rules and to target
this I have studied currently implemented logical rules and the rule
framework. In particular, I felt that rules dealing with LOfilter are not
able to handle complicated boolean expressions. I would like to share
suggestions to improve handling of boolean expressions in LOFilter to enable
better optimization.

1. SplitFilter Rule : SplitFilter rule is splitting one LOFilter into two by
"AND". However it will not be able to split LOFilter if the top level
operator is "OR". For example:

*ex script:*
A = load 'file_a' USING PigStorage(',') as (a1:int,a2:int,a3:int);
B = load 'file_b' USING PigStorage(',') as (b1:int,b2:int,b3:int);
C = load 'file_c' USING PigStorage(',') as (c1:int,c2:int,c3:int);
J1 = JOIN B by b1, C by c1;
J2 = JOIN J1 by $0, A by a1;
D = *Filter J2 by ( (c1 < 10) AND (a3+b3 > 10) ) OR (c2 == 5);*
explain D;

In the above example current rule is not able to any filter condition across
any join as it contains columns from all branches (inputs). But if we
convert this expression into "Conjunctive Normal Form" (CNF) then we would
be able to push filter condition c1< 10 and c2 == 5 below both join
conditions. Here is the CNF expression for highlighted line:

( (c1 < 10) OR (c2 == 5) ) AND ( (a3+b3 > 10) OR (c2 ==5) )

*Suggestions:* It would be a good idea to convert LOFilter's boolean
expression into CNF, it would then be easy to push parts (conjuncts) of the
LOFilter boolean expression selectively.

I have started thinking about the design for implementing this conversion
(arbitrary boolean expression to CNF) and would appreciate any feedback or


  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message