Hopefully by this week. I'm still in the debugging phase of the work.
While you are welcome to reuse some of my algorithms, I doubt you can
reuse the code as much as you want. It's basically for my DNF use. You
might need to factor out some general codes which you can find
reusable.
I fully understand the I/O benefits as I put in my first message. And it
is classified as "Scenario 1". There is no doubt that it should be
considered as part of your work. However, for this, CNF is not
necessary.
For scenario 2, the benefits will be from less incore logical
expression evaluation costs and no I/O benefits as I can see. And use of
CNF may or may not lead to cheaper evaluations as the example in my
first message shows. In other words, after use of CNF, you should
compare the eval cost with that in the original expression eval before
deciding either the CNF or the original form should be evaluated.
Please let me know if I miss any of your points.
Thanks,
Yan
________________________________
From: Swati Jain [mailto:swati.j@aggiemail.usu.edu]
Sent: Monday, July 12, 2010 11:52 AM
To: Yan Zhou
Cc: pigdev@hadoop.apache.org
Subject: Re: PIG Logical Optimization: Use CNF in SplitFilter
I was wondering if you are not going to check in your patch soon then it
would be great if you could share it with me. I believe I might be able
to reuse some of your (utility) functionality directly or get some
ideas.
About your costbenefit question:
1) I will control the complexity of CNF conversion by providing a
configurable threshold value which will limit the ORnesting.
2) One benefit of this conversion is that it will allow pushing parts of
a filter (conjuncts) across the joins which is not happening in the
current PushUpFilter optimization. Moreover, it may result in a
cascading effect to push the conjuncts below other operators by other
rules that may be fired as a result. The benefit from this is really
data dependent, but in bigdata workloads, any kind of predicate
pushdown may eventually lead to big savings in amount of data read or
amount of data transfered/shuffled across the network (I need to
understand the LogicalPlan to PhysicalPlan conversion better to give
concrete examples).
Thanks!
Swati
On Mon, Jul 12, 2010 at 10:36 AM, Yan Zhou <yanz@yahooinc.com> wrote:
Yes, I already implemented the "NOT push down" upfront, so you do not
need to do that.
The support of CNF will probably be the most difficulty part. But as I
mentioned last time, you should compare the cost after the trimming CNF
to get the postsplit filtering logic. Given the complexity of
manipulating CNF and undetermined benefits, I am not sure it should be
in scope at this moment or not.
To handle CNF, I think it's a good idea to create a new plan and connect
the nodes in the new plan to the base plan as you envisioned. In my
changes, which uses DNF instead of CNF but processing is similar
otherwise, I use a LogicalExpressionProxy, which contains a "source"
member that is just the node in the original plan, to link the nodes in
the new plan and old plan. The original LogicalExpression is enhanced
with a counter to trace the # of proxies of the original nodes since
normal form creation will "spread" the nodes in the original tree across
many normalized nodes. The benefit, aside from not setting the plan, is
that the original expression is trimmed according to the processing
results from DNF; while DNF is created separately and as a kinda utility
so that complex features can be used. In my changes, I used
multiplechild tree in DNF while not changing the original binary
expression tree structure. Another benefit is that the original tree is
kept as much as it is at the start, i.e., I do not attempt to optimize
its overall structure beyond trimming based upon the simplification
logics. (I also control the size of DNF to 100 nodes.) The down side of
this is added complexity.
But in your case, for scenario 2 which is the whole point to use CNF,
you would need to change the original expression tree structurally
beyond trimming for postsplit filtering logic. The other benefit of
using multiplechild expression is depending upon if you plan to support
such expression to replace current binary tree
in the final plan. Even though I think it's a good idea to support that,
but it is not in my scope now.
I'll add my algorithm details soon to my jira. Please take a look and
comment as you see appropriate.
Thanks,
Yan
________________________________
From: Swati Jain [mailto:swati.j@aggiemail.usu.edu]
Sent: Friday, July 09, 2010 11:00 PM
To: Yan Zhou
Cc: pigdev@hadoop.apache.org
Subject: Re: PIG Logical Optimization: Use CNF in SplitFilter
Hi Yan,
I agree that the first scenario (filter logic applied to individual
input sources) doesn't need conversion to CNF and that it will be a good
idea to add CNF functionality for the second scenario. I was also
planning to provide a configurable threshold value to control the
complexity of CNF conversion.
As part of the above, I wrote a utility to push the "NOT" operator in
predicates below the "AND" and "OR" operators (Scenario 2 in PIG1399).
I am considering making this utility to push NOT a separate rule in
itself. Lmk if you have already implemented this.
While implementing this utility I am facing some trouble in maintaining
OperatorPlan consistent as I rewrite the expression. This is because
each operator is referencing the main filter logical plan. Here is my
current approach of implementation:
1. I am creating a new LogicalExpressionPlan for the converted boolean
expression.
2. I am creating new logical expressions while pushing the NOT
operation, converting AND into OR, OR into AND eliminating NOT NOT
pairs.
3. However, I am having trouble updating the LogicalExpressionPlan if it
reaches the base case ( i.e. root operator is not NOT,AND,OR).
D = Filter J2 by ( (c2 == 5) OR ( NOT( (c1 < 10) AND (c3+b3 > 10 ) ) )
);
In the above, for example, I am not sure how to integrate base
expression (c2 == 5) into the new LogicalExpressionPlan. There is no
routine to set the plan for a given operator and its children. Also,
there is currently no way to deepCopy an expression into a new
OperatorPlan. It would be great if you could give me some suggestions on
what approach to take for this.
One approach I thought of is to visit the base expression and create and
connect the base expression to the LogicalExpressionPlan as I visit it.
Thoughts?
Swati
ps: About your other point regarding binary vs multi way trees, the way
I am creating the normal form is a list of conjuncts, where each
conjunct is a list of disjuncts. This is logically similar to a multi
waytree. However, the current modeling of boolean expressions (modeled
as binary expressions) requires a conversion back to the binary tree
model when adding back to the main plan.
On Tue, Jul 6, 2010 at 12:46 PM, Yan Zhou <yanz@yahooinc.com> wrote:
Swati,
I happen to be working on the logical expression simplification effort
(https://issues.apache.org/jira/browse/PIG1399), but not on the filter
split front. So I guess our interests will have some overlaps.
I think the filter logic split problem can be divided into 2 parts:
1) the filtering logic that can be applied to individual input sources;
and 2) the filtering logic that has to be applied when merged(or joined)
inputs are processed.
The benefits for 1) are any benefits if the underlying storage supports
predicate pushdown; plus the memory/CPU savings by PIG for not
processing the unqualified rows.
For 2), the purpose is not paying higher evaluation costs than
necessary.
For 1), no normal form is necessary. The original logical expression
tree
can be trimmed off any subexpressions that are not constants nor just
from a particular input source. The complexity is linear with the tree
size; while the use of normal form could potentially lead to exponential
complexity. The difficulty with this approach is how to generate the
filtering logic for 2); while CNF can be used to easily figure out the
logic for 2). However, the exact logic in 2) might not be cheaper to
evaluate than the original logical expression. An example is "Filter J2
by ((C1 < 10) AND (a3+b3>10)) OR ((C2 == 5) AND (a2+b2 >5))". In 2) the
filtering logic after CNF will be "((C1 < 10) OR (a2+b2 > 5)) AND
((a3+b3>10) OR (C2 == 5)) AND ((a3+b3 >10) OR (a2+b2 > 5))". The cost
will be 5 logical evaluations (3 OR plus 2 AND), which could be reduced
to 4, compared with 3 logical evaluations in the original form.
In summary, if only 1) is desired, the tree trimming is enough. If 2) is
desired too, then CNF could be used but its complexity should be
controlled and the cost of the filtering logic evaluation in 2) should
be computed and compared with the original expression evaluation cost.
Further optimization is possible in this direction.
Another potential optimization to consider is to support logical
expression tree of multiple children vs. the binary tree after taking
into consideration of the commutative property of OR and AND operations.
The advantages are less tree traversal costs and easier to change the
evaluation ordering within the same subtree in order to maximize the
possibilities to shortcut the evaluations. Although this is general for
all logical expressions, this tends to be more suitable for normal form
handlings as normal forms group the subexpressions by the operators
that act on the subexpressions.
Thanks,
Yan
Original Message
From: Swati Jain [mailto:swati.j@aggiemail.usu.edu]
Sent: Monday, July 05, 2010 2:34 AM
To: pigdev@hadoop.apache.org
Cc: Daniel Dai
Subject: PIG Logical Optimization: Use CNF in SplitFilter
Hi,
I am interested in implementing logical optimization rules and to target
this I have studied currently implemented logical rules and the rule
framework. In particular, I felt that rules dealing with LOfilter are
not
able to handle complicated boolean expressions. I would like to share
suggestions to improve handling of boolean expressions in LOFilter to
enable
better optimization.
1. SplitFilter Rule : SplitFilter rule is splitting one LOFilter into
two by
"AND". However it will not be able to split LOFilter if the top level
operator is "OR". For example:
*ex script:*
A = load 'file_a' USING PigStorage(',') as (a1:int,a2:int,a3:int);
B = load 'file_b' USING PigStorage(',') as (b1:int,b2:int,b3:int);
C = load 'file_c' USING PigStorage(',') as (c1:int,c2:int,c3:int);
J1 = JOIN B by b1, C by c1;
J2 = JOIN J1 by $0, A by a1;
D = *Filter J2 by ( (c1 < 10) AND (a3+b3 > 10) ) OR (c2 == 5);*
explain D;
In the above example current rule is not able to any filter condition
across
any join as it contains columns from all branches (inputs). But if we
convert this expression into "Conjunctive Normal Form" (CNF) then we
would
be able to push filter condition c1< 10 and c2 == 5 below both join
conditions. Here is the CNF expression for highlighted line:
( (c1 < 10) OR (c2 == 5) ) AND ( (a3+b3 > 10) OR (c2 ==5) )
*Suggestions:* It would be a good idea to convert LOFilter's boolean
expression into CNF, it would then be easy to push parts (conjuncts) of
the
LOFilter boolean expression selectively.
I have started thinking about the design for implementing this
conversion
(arbitrary boolean expression to CNF) and would appreciate any feedback
or
ideas.
Thanks!
Swati
