hadoop-pig-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Swati Jain <swat...@aggiemail.usu.edu>
Subject Re: PIG Logical Optimization: Use CNF in SplitFilter
Date Mon, 12 Jul 2010 18:52:26 GMT
I was wondering if you are not going to check in your patch soon then it
would be great if you could share it with me. I believe I might be able to
reuse some of your (utility) functionality directly or get some ideas.

About your cost-benefit question:
1) I will control the complexity of CNF conversion by providing a
configurable threshold value which will limit the OR-nesting.
2) One benefit of this conversion is that it will allow pushing parts of a
filter (conjuncts) across the joins which is not happening in the current
PushUpFilter optimization. Moreover, it may result in a cascading effect to
push the conjuncts below other operators by other rules that may be fired as
a result. The benefit from this is really data dependent, but in big-data
workloads, any kind of predicate pushdown may eventually lead to big savings
in amount of data read or amount of data transfered/shuffled across the
network (I need to understand the LogicalPlan to PhysicalPlan conversion
better to give concrete examples).


On Mon, Jul 12, 2010 at 10:36 AM, Yan Zhou <yanz@yahoo-inc.com> wrote:

>  Yes, I already implemented the “NOT push down” upfront, so you do not
> need to do that.
> The support of CNF will probably be the most difficulty part. But as I
> mentioned last time, you should compare the cost after the trimming CNF to
> get the post-split filtering logic. Given the complexity of manipulating CNF
> and undetermined benefits, I am not sure it should be in scope at this
> moment or not.
> To handle CNF, I think it’s a good idea to create a new plan and connect
> the nodes in the new plan to the base plan as you envisioned. In my changes,
> which uses DNF instead of CNF but processing is similar otherwise, I use a
> LogicalExpressionProxy, which contains a “source” member that is just the
> node in the original plan, to link the nodes in the new plan and old plan.
>  The original LogicalExpression is enhanced with a counter to trace the # of
> proxies of the original nodes since normal form creation will “spread” the
> nodes in the original tree across many normalized nodes. The benefit, aside
> from not setting the plan, is that the original expression is trimmed
> according to the processing results from DNF; while DNF is created
> separately and as a kinda utility so that complex features can be used. In
> my changes, I used multiple-child tree in DNF while not changing the
> original binary expression tree structure. Another benefit is that the
> original tree is kept as much as it is at the start, i.e., I do not attempt
> to optimize its overall structure beyond trimming based upon the
> simplification logics. (I also control the size of DNF to 100 nodes.) The
> down side of this is added complexity.
> But in your case, for scenario 2 which is the whole point to use CNF, you
> would need to change the original expression tree structurally beyond
> trimming for post-split filtering logic. The other benefit of using
> multiple-child expression is depending upon if you plan to support such
> expression to replace current binary tree
> in the final plan. Even though I think it’s a good idea to support that,
> but it is not in my scope now.
> I’ll add my algorithm details soon to my jira. Please take a look and
> comment as you see appropriate.
> Thanks,
> Yan
>  ------------------------------
> *From:* Swati Jain [mailto:swati.j@aggiemail.usu.edu]
> *Sent:* Friday, July 09, 2010 11:00 PM
> *To:* Yan Zhou
> *Cc:* pig-dev@hadoop.apache.org
> *Subject:* Re: PIG Logical Optimization: Use CNF in SplitFilter
> Hi Yan,
> I agree that the first scenario (filter logic applied to individual input
> sources) doesn't need conversion to CNF and that it will be a good idea to
> add CNF functionality for the second scenario. I was also planning to
> provide a configurable threshold value to control the complexity of CNF
> conversion.
> As part of the above, I wrote a utility to push the "NOT" operator in
> predicates below the "AND" and "OR" operators (Scenario 2 in PIG-1399). I am
> considering making this utility to push NOT a separate rule in itself. Lmk
> if you have already implemented this.
> While implementing this utility I am facing some trouble in maintaining
> OperatorPlan consistent as I rewrite the expression. This is because each
> operator is referencing the main filter logical plan. Here is my current
> approach of implementation:
> 1. I am creating a new LogicalExpressionPlan for the converted boolean
> expression.
> 2. I am creating new logical expressions while pushing the NOT operation,
> converting AND into OR, OR into AND eliminating NOT NOT pairs.
> 3. However, I am having trouble updating the LogicalExpressionPlan if it
> reaches the base case ( i.e. root operator is not NOT,AND,OR).
> D = Filter J2 by ( (c2 == 5) OR ( NOT( (c1 < 10) AND (c3+b3 > 10 ) ) ) );
> In the above, for example, I am not sure how to integrate base expression
> (c2 == 5) into the new LogicalExpressionPlan. There is no routine to set the
> plan for a given operator and its children. Also, there is currently no way
> to deepCopy an expression into a new OperatorPlan. It would be great if you
> could give me some suggestions on what approach to take for this.
> One approach I thought of is to visit the base expression and create and
> connect the base expression to the LogicalExpressionPlan as I visit it.
> Thoughts?
> Swati
> ps: About your other point regarding binary vs multi way trees, the way I
> am creating the normal form is a list of conjuncts, where each conjunct is a
> list of disjuncts. This is logically similar to a multi waytree. However,
> the current modeling of boolean expressions (modeled as binary expressions)
> requires a conversion back to the binary tree model when adding back to the
> main plan.
> On Tue, Jul 6, 2010 at 12:46 PM, Yan Zhou <yanz@yahoo-inc.com> wrote:
> Swati,
> I happen to be working on the logical expression simplification effort
> (https://issues.apache.org/jira/browse/PIG-1399), but not on the filter
> split front. So I guess our interests will have some overlaps.
> I think the filter logic split problem can be divided into 2 parts:
> 1) the filtering logic that can be applied to individual input sources;
> and 2) the filtering logic that has to be applied when merged(or joined)
> inputs are processed.
> The benefits for 1) are any benefits if the underlying storage supports
> predicate pushdown; plus the memory/CPU savings by PIG for not
> processing the unqualified rows.
> For 2), the purpose is not paying higher evaluation costs than
> necessary.
> For 1), no normal form is necessary. The original logical expression
> tree
> can be trimmed off any sub-expressions that are not constants nor just
> from a particular input source. The complexity is linear with the tree
> size; while the use of normal form could potentially lead to exponential
> complexity. The difficulty with this approach is how to generate the
> filtering logic for 2); while CNF can be used to easily figure out the
> logic for 2). However, the exact logic in 2) might not be cheaper to
> evaluate than the original logical expression. An example is "Filter J2
> by ((C1 < 10) AND (a3+b3>10)) OR ((C2 == 5) AND (a2+b2 >5))". In 2) the
> filtering logic after CNF will be "((C1 < 10) OR (a2+b2 > 5)) AND
> ((a3+b3>10) OR (C2 == 5)) AND ((a3+b3 >10) OR (a2+b2 > 5))". The cost
> will be 5 logical evaluations (3 OR plus 2 AND), which could be reduced
> to 4, compared with 3 logical evaluations in the original form.
> In summary, if only 1) is desired, the tree trimming is enough. If 2) is
> desired too, then CNF could be used but its complexity should be
> controlled and the cost of the filtering logic evaluation in 2) should
> be computed and compared with the original expression evaluation cost.
> Further optimization is possible in this direction.
> Another potential optimization to consider is to support logical
> expression tree of multiple children vs. the binary tree after taking
> into consideration of the commutative property of OR and AND operations.
> The advantages are less tree traversal costs and easier to change the
> evaluation ordering within the same sub-tree in order to maximize the
> possibilities to short-cut the evaluations. Although this is general for
> all logical expressions, this tends to be more suitable for normal form
> handlings as normal forms group the sub-expressions by the operators
> that act on the sub-expressions.
> Thanks,
> Yan
> -----Original Message-----
> From: Swati Jain [mailto:swati.j@aggiemail.usu.edu]
> Sent: Monday, July 05, 2010 2:34 AM
> To: pig-dev@hadoop.apache.org
> Cc: Daniel Dai
> Subject: PIG Logical Optimization: Use CNF in SplitFilter
> Hi,
> I am interested in implementing logical optimization rules and to target
> this I have studied currently implemented logical rules and the rule
> framework. In particular, I felt that rules dealing with LOfilter are
> not
> able to handle complicated boolean expressions. I would like to share
> suggestions to improve handling of boolean expressions in LOFilter to
> enable
> better optimization.
> 1. SplitFilter Rule : SplitFilter rule is splitting one LOFilter into
> two by
> "AND". However it will not be able to split LOFilter if the top level
> operator is "OR". For example:
> *ex script:*
> A = load 'file_a' USING PigStorage(',') as (a1:int,a2:int,a3:int);
> B = load 'file_b' USING PigStorage(',') as (b1:int,b2:int,b3:int);
> C = load 'file_c' USING PigStorage(',') as (c1:int,c2:int,c3:int);
> J1 = JOIN B by b1, C by c1;
> J2 = JOIN J1 by $0, A by a1;
> D = *Filter J2 by ( (c1 < 10) AND (a3+b3 > 10) ) OR (c2 == 5);*
> explain D;
> In the above example current rule is not able to any filter condition
> across
> any join as it contains columns from all branches (inputs). But if we
> convert this expression into "Conjunctive Normal Form" (CNF) then we
> would
> be able to push filter condition c1< 10 and c2 == 5 below both join
> conditions. Here is the CNF expression for highlighted line:
> ( (c1 < 10) OR (c2 == 5) ) AND ( (a3+b3 > 10) OR (c2 ==5) )
> *Suggestions:* It would be a good idea to convert LOFilter's boolean
> expression into CNF, it would then be easy to push parts (conjuncts) of
> the
> LOFilter boolean expression selectively.
> I have started thinking about the design for implementing this
> conversion
> (arbitrary boolean expression to CNF) and would appreciate any feedback
> or
> ideas.
> Thanks!
> Swati

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message