hadoop-pig-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Yan Zhou (JIRA)" <j...@apache.org>
Subject [jira] Commented: (PIG-1399) Logical Optimizer: Expression optimizor rule
Date Tue, 13 Jul 2010 18:41:52 GMT

    [ https://issues.apache.org/jira/browse/PIG-1399?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12887923#action_12887923

Yan Zhou commented on PIG-1399:

This work is not to optimize on the generic boolean logics, but rather to simplify the logic
expression based upon the constant values as the logical expression's operands. The former,
e.g.,  would change an boolean expression of ((A AND B) OR (A AND C)) to (A AND (B OR C));
while the latter will change, say,  (a0 > 5 and a0 > 7) to (a0 > 7).  It is, therefore,
up to the query writer to optimize his/her boolean logic, probably through use of some other

The algorithm works in a series of steps in order:

1) a constant expression evaluation visitor that will evaluate the constant expressions. It
works by traversing the expression tree in a bottom-up manner and evaluate all subexpressions
that have all constant subexpressions. All results from constant children are pushed to a
stack for the parent to digest for its own evaluation. Any non-constant expression will push
a null to the stack and consequently will cause all of its ancestors not to be evaluated.
For simplicity, only constant binary expressions and constant unary expressions and evaluated.
More complex expressions are not evaluated at this moment. For UDF, this evaluation is not
planned at all for fear of possible stateful consequences resulting from the original evaluations;

2) A NOT conversion visitor that will traverse the expression tree in a depth-first manner
with post-order handling. A status of negativity for a NOT expression is recorded in the depth-first
traversal before subtree traversal and reversed after traversing the subtree. All "reversible"
expressions is replaced by its negated counterpart for negative negativity. Currently equality
ops, and its non-equality couter part, all range comparisons, logical AND and OR are reversible.
   Notably missing is the "is null" for lack of a "is not null" base expression;

3) A DNF plan is generated, through a helper DNFPlanGenerator visitor class, whose disjunctions
are either of OrExpression or a new DNFExpression with type of "OR", whose conjunctions are
either of AndExpression or the new DNFExpression with type of "AND". 
   The introduction of the new DNFExpression, which extends LogiclExpression, is to support
multiple children (vs. the two children in a BinaryExpression) to facilitate the processing
of multiple children
   of an "OR" or "AND" operator due to the commutative property of the two operators. The
leaves of the DNF are of a new LogiclExpressionProxy type that extends the LogicalExpression.
   This new type is to be used as a proxy toward the original leaf expression in the original
filter plan. The purpose is to track how often an original expression has been put in 
   the DNF plan as result of the normalizing process. Consequently, a DNFSplitCounter member
is added to the LogicalExpression, which is incremented once a new proxy is created
   on the original expression. Due to the potentially exponential growth of the DNF plan,
and the nonlinear complexity to trim the DNF plan (see 4 below), the size of the DNF plan
is limited to 100 nodes beyind which the simplification
   beyond step 2) are just skipped;

4) Then the DNF plan is trimmed according to the inferrence rules between the operands of
the conjunctions first, and then between the operands of the disjuction in the DNF plan.
   If a leaf is trimmed, the counter, DNFSpliCounter, of the source of the proxy will be decremented.
Basically, the DNF plan is used as a utility to determine if an original leaf
   expression can be trimmed from the original filter plan or not. If all proxies of the original
leaf expression have been trimmed from the DNF plan, the original leaf expression can be trimmed
from the original plan then.
   The point is that the DNF plan is not intended to replace the original filer plan since
the DNF plan in general tends to be more expensive to evaluate than the original filter plan.

5) The original filter plan is traversed in a bottom-up manner so that if a leaf's DNFSpliCounter
is zero, which means all of its proxies on DNF has been trimmed, the leaf will be trimmed.
   For nonleafs of "AND" or "OR" expressions, if one child survives, the child will be relinked
to the predecessor(s). If either or both children are trimmed, the nonleaf will be trimmed
   too. If the whole new filter plan is empty, the filter operator will be removed from the
logical plan too.

Using a example of "B = filter A by NOT((a0 > 1) or (a1 < 3 and a0 >3+2))":

After 1), the filter plan becomes "NOT((a0 > 1) or (a1 < 3 and a0 >5))";

After 2), the filter plan becomes "(a0 <= 1) AND ((a1 >= 3) OR (a0 <= 5))";

After 3), the DNF plan is "((a0 <= 1) AND (a1 >= 3)) OR ((a0 <= 1) AND (a0 <=

After 4), the DNF plan becomes "a0 <= 1";

After 5), the filter plan becomes "a0 <= 1".

> Logical Optimizer: Expression optimizor rule
> --------------------------------------------
>                 Key: PIG-1399
>                 URL: https://issues.apache.org/jira/browse/PIG-1399
>             Project: Pig
>          Issue Type: Sub-task
>          Components: impl
>    Affects Versions: 0.7.0
>            Reporter: Daniel Dai
>            Assignee: Yan Zhou
>             Fix For: 0.8.0
> We can optimize expression in several ways:
> 1. Constant pre-calculation
> Example:
> B = filter A by a0 > 5+7;
> => B = filter A by a0 > 12;
> 2. Boolean expression optimization
> Example:
> B = filter A by not (not(a0>5) or a>10);
> => B = filter A by a0>5 and a<=10;

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message