hadoop-pig-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Swati Jain <swat...@aggiemail.usu.edu>
Subject Re: PIG Logical Optimization: Use CNF in SplitFilter
Date Tue, 13 Jul 2010 00:40:25 GMT
Hi Yan

Thanks for your prompt reply. I did not understand your statement “C1 and
C2, or their equivalent, above JOIN can be easily figured out without
resorting to CNF”.

Consider a LOFilter above a LOJoin. The predicate of LOFilter: ( (c1 < 10)
AND (a3+b3 > 10) ) OR (c2 == 5)

The schema for LOJoin:

A = (a1:int,a2:int,a3:int);
B = (b1:int,b2:int,b3:int);
C = (c1:int,c2:int,c3:int);

After CNF: ( (c1 < 10) OR (c2 == 5) ) AND ( (a3+b3 > 10) OR (c2 ==5) )

Now we can push ( (c1 < 10) OR (c2 == 5) ) above the JOIN (in the branch
leading up to the source C) while ( (a3+b3 > 10) OR (c2 ==5) ) stays put
below the JOIN.

Please let me know if there is a way of doing the above optimization without
converting the original expression to CNF.

Thanks,

Swati


On Mon, Jul 12, 2010 at 4:26 PM, Yan Zhou <yanz@yahoo-inc.com> wrote:

> I see. There looks like some disconnect about "Scenario 1". To me, all
> filtering logics that can be pushed above JOIN can be figured out
> without use of CNF, which is scenario 1; while CNF helps to derive the
> filtering logic after (or, in your example, below) JOIN, which is
> Scenario 2.
>
>
>
> In your example, C1 and C2, or their equivalent, above JOIN can be
> easily figured out without resorting to CNF; C3 may have to be figured
> out with CNF, but evaluation cost of the post-Join filtering logic thus
> generated may not be cheaper than the original one before pushing up.
>
>
>
> In summary, if we want to support scenario 2(and 1), we should use CNF;
> if we JUST want to support scenario 1, which will push up all possible
> filters closer to source and have all benefits on pruned I/O, we should
> not use CNF.
>
>
>
> Thanks,
>
>
>
> Yan
>
>
>
> -----Original Message-----
> From: Swati Jain [mailto:swati.j@aggiemail.usu.edu]
> Sent: Monday, July 12, 2010 4:04 PM
> To: pig-dev@hadoop.apache.org
> Subject: PIG Logical Optimization: Use CNF in SplitFilter
>
>
>
> Yan,
>
>
>
> What I meant in my last email was that scenario 2 optimizations would
> lead
>
> to more opportunities for scenario 1 kind of optimizations.
>
>
>
> Consider the conjunct list [C1;C2;C3] as the source of a JOIN.
>
>
>
> (a)  Suppose none of these are computable on a join input, in this case
> we
>
> retain the original expression and discard the CNF.
>
>
>
> (b)  Suppose C1 is computable on join input J1 and C2 is computable on
> join
>
> input J2 but C3 requires a combination of both join inputs. In this
> case, we
>
> push C1 above J1, C2 above J2 and leave C3 as is below the JOIN. Note
> that
>
> C1 and C2 may be further pushed up (with additional iterations of the
>
> optimizer). If they are now the source of single input operators, it is
>
> similar to scenario 1.
>
>
>
> Thanks,
>
> Swati
>
>
>
>
>
> On Mon, Jul 12, 2010 at 3:14 PM, Yan Zhou <yanz@yahoo-inc.com> wrote:
>
>
>
> >  Hopefully by this week. I'm still in the debugging phase of the work.
>
> > While you are welcome to reuse some of my algorithms, I doubt you can
> reuse
>
> > the code as much as you want. It's basically for my DNF use. You might
> need
>
> > to factor out some general codes which you can find
>
> >
>
> > reusable.
>
> >
>
> >
>
> >
>
> > I fully understand the I/O benefits as I put in my first message. And
> it is
>
> > classified as "Scenario 1". There is no doubt that it should be
> considered
>
> > as part of your work. However, for this, CNF is not necessary.
>
> >
>
> >
>
> >
>
> > For scenario 2, the benefits will be from less in-core logical
> expression
>
> > evaluation costs and no I/O benefits as I can see. And use of CNF may
> or may
>
> > not lead to cheaper evaluations as the example in my first message
> shows. In
>
> > other words, after use of CNF, you should
>
> >
>
> > compare the eval cost with that in the original expression eval before
>
> > deciding either the CNF or the original form should be evaluated.
>
> >
>
> >
>
> >
>
> > Please let me know if I miss any of your points.
>
> >
>
> >
>
> >
>
> > Thanks,
>
> >
>
> >
>
> >
>
> > Yan
>
> >  ------------------------------
>
> >
>
> > *From:* Swati Jain [mailto:swati.j@aggiemail.usu.edu]
>
> > *Sent:* Monday, July 12, 2010 11:52 AM
>
> >
>
> > *To:* Yan Zhou
>
> > *Cc:* pig-dev@hadoop.apache.org
>
> > *Subject:* Re: PIG Logical Optimization: Use CNF in SplitFilter
>
> >
>
> >
>
> >
>
> > I was wondering if you are not going to check in your patch soon then
> it
>
> > would be great if you could share it with me. I believe I might be
> able to
>
> > reuse some of your (utility) functionality directly or get some ideas.
>
> >
>
> > About your cost-benefit question:
>
> > 1) I will control the complexity of CNF conversion by providing a
>
> > configurable threshold value which will limit the OR-nesting.
>
> > 2) One benefit of this conversion is that it will allow pushing parts
> of a
>
> > filter (conjuncts) across the joins which is not happening in the
> current
>
> > PushUpFilter optimization. Moreover, it may result in a cascading
> effect to
>
> > push the conjuncts below other operators by other rules that may be
> fired as
>
> > a result. The benefit from this is really data dependent, but in
> big-data
>
> > workloads, any kind of predicate pushdown may eventually lead to big
> savings
>
> > in amount of data read or amount of data transfered/shuffled across
> the
>
> > network (I need to understand the LogicalPlan to PhysicalPlan
> conversion
>
> > better to give concrete examples).
>
> >
>
> > Thanks!
>
> > Swati
>
> >
>
> > On Mon, Jul 12, 2010 at 10:36 AM, Yan Zhou <yanz@yahoo-inc.com> wrote:
>
> >
>
> > Yes, I already implemented the "NOT push down" upfront, so you do not
> need
>
> > to do that.
>
> >
>
> >
>
> >
>
> > The support of CNF will probably be the most difficulty part. But as I
>
> > mentioned last time, you should compare the cost after the trimming
> CNF to
>
> > get the post-split filtering logic. Given the complexity of
> manipulating CNF
>
> > and undetermined benefits, I am not sure it should be in scope at this
>
> > moment or not.
>
> >
>
> >
>
> >
>
> > To handle CNF, I think it's a good idea to create a new plan and
> connect
>
> > the nodes in the new plan to the base plan as you envisioned. In my
> changes,
>
> > which uses DNF instead of CNF but processing is similar otherwise, I
> use a
>
> > LogicalExpressionProxy, which contains a "source" member that is just
> the
>
> > node in the original plan, to link the nodes in the new plan and old
> plan.
>
> >  The original LogicalExpression is enhanced with a counter to trace
> the # of
>
> > proxies of the original nodes since normal form creation will "spread"
> the
>
> > nodes in the original tree across many normalized nodes. The benefit,
> aside
>
> > from not setting the plan, is that the original expression is trimmed
>
> > according to the processing results from DNF; while DNF is created
>
> > separately and as a kinda utility so that complex features can be
> used. In
>
> > my changes, I used multiple-child tree in DNF while not changing the
>
> > original binary expression tree structure. Another benefit is that the
>
> > original tree is kept as much as it is at the start, i.e., I do not
> attempt
>
> > to optimize its overall structure beyond trimming based upon the
>
> > simplification logics. (I also control the size of DNF to 100 nodes.)
> The
>
> > down side of this is added complexity.
>
> >
>
> >
>
> >
>
> > But in your case, for scenario 2 which is the whole point to use CNF,
> you
>
> > would need to change the original expression tree structurally beyond
>
> > trimming for post-split filtering logic. The other benefit of using
>
> > multiple-child expression is depending upon if you plan to support
> such
>
> > expression to replace current binary tree
>
> >
>
> > in the final plan. Even though I think it's a good idea to support
> that,
>
> > but it is not in my scope now.
>
> >
>
> >
>
> >
>
> > I'll add my algorithm details soon to my jira. Please take a look and
>
> > comment as you see appropriate.
>
> >
>
> >
>
> >
>
> > Thanks,
>
> >
>
> >
>
> >
>
> > Yan
>
> >
>
> >
>
> >
>
> >
>
> >  ------------------------------
>
> >
>
> > *From:* Swati Jain [mailto:swati.j@aggiemail.usu.edu]
>
> > *Sent:* Friday, July 09, 2010 11:00 PM
>
> > *To:* Yan Zhou
>
> > *Cc:* pig-dev@hadoop.apache.org
>
> > *Subject:* Re: PIG Logical Optimization: Use CNF in SplitFilter
>
> >
>
> >
>
> >
>
> > Hi Yan,
>
> >
>
> > I agree that the first scenario (filter logic applied to individual
> input
>
> > sources) doesn't need conversion to CNF and that it will be a good
> idea to
>
> > add CNF functionality for the second scenario. I was also planning to
>
> > provide a configurable threshold value to control the complexity of
> CNF
>
> > conversion.
>
> >
>
> > As part of the above, I wrote a utility to push the "NOT" operator in
>
> > predicates below the "AND" and "OR" operators (Scenario 2 in
> PIG-1399). I am
>
> > considering making this utility to push NOT a separate rule in itself.
> Lmk
>
> > if you have already implemented this.
>
> >
>
> > While implementing this utility I am facing some trouble in
> maintaining
>
> > OperatorPlan consistent as I rewrite the expression. This is because
> each
>
> > operator is referencing the main filter logical plan. Here is my
> current
>
> > approach of implementation:
>
> >
>
> > 1. I am creating a new LogicalExpressionPlan for the converted boolean
>
> > expression.
>
> > 2. I am creating new logical expressions while pushing the NOT
> operation,
>
> > converting AND into OR, OR into AND eliminating NOT NOT pairs.
>
> > 3. However, I am having trouble updating the LogicalExpressionPlan if
> it
>
> > reaches the base case ( i.e. root operator is not NOT,AND,OR).
>
> >
>
> > D = Filter J2 by ( (c2 == 5) OR ( NOT( (c1 < 10) AND (c3+b3 > 10 ) ) )
> );
>
> >
>
> > In the above, for example, I am not sure how to integrate base
> expression
>
> > (c2 == 5) into the new LogicalExpressionPlan. There is no routine to
> set the
>
> > plan for a given operator and its children. Also, there is currently
> no way
>
> > to deepCopy an expression into a new OperatorPlan. It would be great
> if you
>
> > could give me some suggestions on what approach to take for this.
>
> >
>
> > One approach I thought of is to visit the base expression and create
> and
>
> > connect the base expression to the LogicalExpressionPlan as I visit
> it.
>
> >
>
> > Thoughts?
>
> > Swati
>
> >
>
> > ps: About your other point regarding binary vs multi way trees, the
> way I
>
> > am creating the normal form is a list of conjuncts, where each
> conjunct is a
>
> > list of disjuncts. This is logically similar to a multi waytree.
> However,
>
> > the current modeling of boolean expressions (modeled as binary
> expressions)
>
> > requires a conversion back to the binary tree model when adding back
> to the
>
> > main plan.
>
> >
>
> > On Tue, Jul 6, 2010 at 12:46 PM, Yan Zhou <yanz@yahoo-inc.com> wrote:
>
> >
>
> > Swati,
>
> >
>
> > I happen to be working on the logical expression simplification effort
>
> > (https://issues.apache.org/jira/browse/PIG-1399), but not on the
> filter
>
> > split front. So I guess our interests will have some overlaps.
>
> >
>
> > I think the filter logic split problem can be divided into 2 parts:
>
> > 1) the filtering logic that can be applied to individual input
> sources;
>
> > and 2) the filtering logic that has to be applied when merged(or
> joined)
>
> > inputs are processed.
>
> >
>
> > The benefits for 1) are any benefits if the underlying storage
> supports
>
> > predicate pushdown; plus the memory/CPU savings by PIG for not
>
> > processing the unqualified rows.
>
> >
>
> > For 2), the purpose is not paying higher evaluation costs than
>
> > necessary.
>
> >
>
> > For 1), no normal form is necessary. The original logical expression
>
> > tree
>
> > can be trimmed off any sub-expressions that are not constants nor just
>
> > from a particular input source. The complexity is linear with the tree
>
> > size; while the use of normal form could potentially lead to
> exponential
>
> > complexity. The difficulty with this approach is how to generate the
>
> > filtering logic for 2); while CNF can be used to easily figure out the
>
> > logic for 2). However, the exact logic in 2) might not be cheaper to
>
> > evaluate than the original logical expression. An example is "Filter
> J2
>
> > by ((C1 < 10) AND (a3+b3>10)) OR ((C2 == 5) AND (a2+b2 >5))". In 2)
> the
>
> > filtering logic after CNF will be "((C1 < 10) OR (a2+b2 > 5)) AND
>
> > ((a3+b3>10) OR (C2 == 5)) AND ((a3+b3 >10) OR (a2+b2 > 5))". The cost
>
> > will be 5 logical evaluations (3 OR plus 2 AND), which could be
> reduced
>
> > to 4, compared with 3 logical evaluations in the original form.
>
> >
>
> > In summary, if only 1) is desired, the tree trimming is enough. If 2)
> is
>
> > desired too, then CNF could be used but its complexity should be
>
> > controlled and the cost of the filtering logic evaluation in 2) should
>
> > be computed and compared with the original expression evaluation cost.
>
> > Further optimization is possible in this direction.
>
> >
>
> > Another potential optimization to consider is to support logical
>
> > expression tree of multiple children vs. the binary tree after taking
>
> > into consideration of the commutative property of OR and AND
> operations.
>
> > The advantages are less tree traversal costs and easier to change the
>
> > evaluation ordering within the same sub-tree in order to maximize the
>
> > possibilities to short-cut the evaluations. Although this is general
> for
>
> > all logical expressions, this tends to be more suitable for normal
> form
>
> > handlings as normal forms group the sub-expressions by the operators
>
> > that act on the sub-expressions.
>
> >
>
> > Thanks,
>
> >
>
> > Yan
>
> >
>
> >
>
> > -----Original Message-----
>
> > From: Swati Jain [mailto:swati.j@aggiemail.usu.edu]
>
> > Sent: Monday, July 05, 2010 2:34 AM
>
> > To: pig-dev@hadoop.apache.org
>
> > Cc: Daniel Dai
>
> > Subject: PIG Logical Optimization: Use CNF in SplitFilter
>
> >
>
> > Hi,
>
> >
>
> > I am interested in implementing logical optimization rules and to
> target
>
> > this I have studied currently implemented logical rules and the rule
>
> > framework. In particular, I felt that rules dealing with LOfilter are
>
> > not
>
> > able to handle complicated boolean expressions. I would like to share
>
> > suggestions to improve handling of boolean expressions in LOFilter to
>
> > enable
>
> > better optimization.
>
> >
>
> > 1. SplitFilter Rule : SplitFilter rule is splitting one LOFilter into
>
> > two by
>
> > "AND". However it will not be able to split LOFilter if the top level
>
> > operator is "OR". For example:
>
> >
>
> > *ex script:*
>
> > A = load 'file_a' USING PigStorage(',') as (a1:int,a2:int,a3:int);
>
> > B = load 'file_b' USING PigStorage(',') as (b1:int,b2:int,b3:int);
>
> > C = load 'file_c' USING PigStorage(',') as (c1:int,c2:int,c3:int);
>
> > J1 = JOIN B by b1, C by c1;
>
> > J2 = JOIN J1 by $0, A by a1;
>
> > D = *Filter J2 by ( (c1 < 10) AND (a3+b3 > 10) ) OR (c2 == 5);*
>
> > explain D;
>
> >
>
> > In the above example current rule is not able to any filter condition
>
> > across
>
> > any join as it contains columns from all branches (inputs). But if we
>
> > convert this expression into "Conjunctive Normal Form" (CNF) then we
>
> > would
>
> > be able to push filter condition c1< 10 and c2 == 5 below both join
>
> > conditions. Here is the CNF expression for highlighted line:
>
> >
>
> > ( (c1 < 10) OR (c2 == 5) ) AND ( (a3+b3 > 10) OR (c2 ==5) )
>
> >
>
> > *Suggestions:* It would be a good idea to convert LOFilter's boolean
>
> > expression into CNF, it would then be easy to push parts (conjuncts)
> of
>
> > the
>
> > LOFilter boolean expression selectively.
>
> >
>
> > I have started thinking about the design for implementing this
>
> > conversion
>
> > (arbitrary boolean expression to CNF) and would appreciate any
> feedback
>
> > or
>
> > ideas.
>
> >
>
> > Thanks!
>
> > Swati
>
> >
>
> >
>
> >
>
> >
>
> >
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message