pig-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <wikidi...@apache.org>
Subject [Pig Wiki] Update of "PigIllustrate" by YanZhou
Date Mon, 04 Oct 2010 23:30:36 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Pig Wiki" for change notification.

The "PigIllustrate" page has been changed by YanZhou.


New page:

The ILLUSTRATE command is used to demonstrate how a few "good" example input data are transformed
through a series of PIG commands, as represented by the corresponding aliases, being executed.
The quality of the "good" data, and consequently the quality of the illustration, is judged
by three measurements: completeness, conciseness and the degree of realism. The concept and
the algorithm is from http://i.stanford.edu/~olston/publications/sigmod09.pdf.

After the base data from a portion of a real input are fetched and all their derived data
along the command tree are produced, there are three basic steps to optimize the three measurements.
First, a pruning is performed to maximize conciseness without sacrificing completeness. Next,
a upstream synthesis step is executed, at "best effort", to enhance the completeness while
maintain as much as realism as possible. Finally, one more pruning is performed to trim away
any possible redundant records that are hurting conciseness after the possible introduction
of synthetic records in the second step.

The command is executed in local mode.

== Status quo of the Codes on Trunk ==

The current code on trunk should support LOAD/STORE/COGROUP/JOIN/FILTER/UNION, but is still
deemed to be incomplete or obsolete or of quality issues in the following areas:

   * it is using the older logical plan scheme;
   * the optimization is applied in a very limited way and no new optimization rules are used
(of course!);
   * it does not support LIMIT/SPLIT/DISTINCT/CROSS/STREAM operators;
   * MAP type is not supported;
   * Inner plans of !ForEach and Filter are not accounted for when figuring out completeness/conciseness
of the example data, making the completeness calculation only marginally valid;
   * Unit test cases are few and far between. The coverage is small and estimated to be less
than 5% and lacks of full functional tests. The verification is cursory. No end-to-end tests
are present.
   * During pruning when equivalence classes and affinity groups are used on a per-operator
basis, the handling are separated from the data generation process, making it difficult to
manage as they work on the generated data;
   * dead codes, unused variables/arguments/classes are scattered around.

== Suggested additional development work ==

Use of the new logical plan, code cleansing, test case enhancements and bug fixes notwithstanding,
the major work will be on the functional side. Although the basic algorithm is to be adhered
to, in some areas it will be enhanced slightly so as to follow the operators' semantics yet
still honor the three principles of completeness, conciseness and realism as much as can.

=== Feasibility of Equivalence Classes ===
One corner stone of the algorithm is the concept of the "equivalence classes" that classifies
records an operator processes into the "classes" that any single one record therein is sufficient
to represent the whole class of record in one aspect of the operator. This concept has two
implicit requirements on the operators: 1) the classification is value-based; 2) the classification
of a record is independent of all other records' classification.

However, not all PIG operators meet these two requirements. Some, such as FILTER, LOAD and
UNION fit nicely. Some others, like COGROUP and DISTINCT, needs a bit more info, "affinity
level" in the case of COGROUP, to add more constraints during pruning. Others, like LIMIT,
do not fit well since they are not value-based and sensitive to past records' classification.
For those operators, after the normal pruning-synthesis-pruning steps, a post process will
be introduced to restore the criteria according to the operators' semantics.

=== Equivalence Classes on Expressions and Functions ===

One more hurdle regarding the equivalence classes is that they are now only defined on relational
operators only; while expression operators are not in picture. But the functions and expression
operators that produce finite results should have equivalence classes too to make the completeness
coverage check, well, complete. Consequently no equivalence classes are defined for so-called
"multifaceted" filtering beyond the simple "projection comparison with constant" filtering.
Specifically, the comparison operators, the Boolean operators, the SIGN and NULL operators
and the !IsEmpty and !bincond functions, (namely all the operators and functions that generate
boolean outputs, intermediate or final, for now; in the future if other operators/functions
that generate finite results are to be introduced, they will have defined equivalence classes
too), should have defined the equivalence classes.

In contrast, those operators that generate continuous results, the arithmetic operators, e.g.,
won't have defined equivalence classes. One might think, e.g., that the division operator,
'/', may well define two equivalence classes with the second holding the inputs that have
zero denominator. However, that should be treated as an exception handling case and the ILLUSTRATE
command should not take exception handling into consideration.

=== Data Synthesis and Operator/Expression Invertibility ===

During the data synthesis process, operators are desired to be able to generate input values
based from given output values. Currently only comparison operators are capable of generating
the "reverse" results. This will be extended to all expression operators.

However, to support "multifaceted" filtering, the equivalence classes need to be on all combinations
of possible Boolean truth values of the primitive logical expressions. If all the primitive
logical expressions are unrelated, it is fine to simply synthesize the input constraint fields
separately. Otherwise, the multidimensional space of the input constrain fields could be carved
in any arbitrarily complex way so that equivalence classes definition is made impracticable
if not infeasible. For instance, a simple compound logical expression of "(a > 5) AND (b
< 3)" carved the 2-d space of (a, b) into 4 sections and one pair of synthetic values independently
chosen along both axis in each section will well represent an equivalence class. A slightly
more complex compound logical expression of "((a+b > 5) AND (a-b < 3))" will still carve
the 2-d space into 4 sections. But it is impossible to independently select the values of
a and b to represent each of all the 4 sections. Given this complexity and the "best-effort"
nature already in the basic algorithm,
we will synthesize data for "multifaceted" filtering only if all the primitive logical expressions
of a compound logical expression are independent. All other types of compound logical expressions
will use the existing "best-effort" approach for the top-level logical expression. But the
equivalence classes will still be defined on all possible combinations of the truth values
of the primitive logical expressions so as to avoid unnecessary completeness reduction during

UDFs are currently considered "irreversible". In the future, we may want to support hooks
so that UDFs can provide a way to generate the 'reverse" results. That would be beneficial
for UDF testing and verification purposes.

Given the general need to synthesize data in Operators and Expressions, and the polymorphism
nature of operators and expressions, a new method, !getSyntheticInput, will be intruduced
in the !EvalFunc and 
!PhysicalOperator classes that defaults to no-op but otherwise will work out the input constraints
based upon existing data and output constraints. This method will conceivably take place of
many existing methods
of the !AugmentBaseDataVisitor class that handle the data synthesis process.

However, in this version, there is no plan to support synthesis data on any expressions or

== Operator-Specific Features ==

=== CROSS ===

Each output record generated by inputs that have more than one records is put into a single
equivalence class. The purpose is to illustrate at least a 2-by-2 cross product between two

=== LIMIT ===

In base data generation, all rows will be present for the prunning-synthesis-prunning procedure,
namely, LIMIT is treated as a no-op. In the post process, one synthetic record is added as
the input constraint to demonstrate the effect of the LIMIT operator. Note that by doing this,
the original LIMIT value is most likely not repected. As an explanation, a message will be
display to the effect of the changed value for illustration purpose.
< The use of the post process step takes the following couple of points into consideration:

   * the semantics of LIMIT does not guarantee any particular records to be retained;
   * An implementation artifact is that the first records are to be retained, thus tend to
hurt completeness tremendously. Consequently the usual prunning-synthesis-prunning steps on
the data set after LIMIT would tend to cause many synthetic records while there would be many
real yet unused "limited out" records.

=== DISTINCT ===

It has two equivalence classes, one with unique records, the other the opposite. A data structure
of map from the equivalence class of duplicate records to a pair of the equivalence class
of unique records and a set of !IdentitySet of duplicate records is maintained to facilitate
the transfer of a record from the equivalence class of duplicate records to that of unique
records as result of prunning and, eventually, disallow the prunning of last record(s) in
either equivalence class;


Lineage tracers are set up in the nested block of FOREACH to allow for the coverage of the
inner plan.

Results for nested aliases will be displayed the same way as normal operators but with the
operator name scoped with the enclosing alias using the '.' as the scoping separator.

MAP deferences will have different equivalence classes for each different deference. This
is to consider the fact that a deference on a map key is more likely to be null than a bag
or tuple deference.

=== SPLIT ===

Implicit SPLIT will have a single equivalence class so that at least one record will be pushed
for downstream processing.

For Explicit SPLIT, for the same sake of complexity and feasibility as the "multifaceted filtering",
an approach similar to the composite expression in FILTER is adopted during the pruning and
synthesis processes. That is, in the pruning, equivalence classes are created based upon combinations
of possible logical truth values of the children FILTER logical expressions; in the upstream
synthesis process, however, only the truth values of the top level of the children FILTERs'
logical expressions are considered for completeness.

In the above discussion, the children of a SPLIT are those that are traversed by the operations
leading to the alias being illustrated, of course.

For upstream data synthesis, SPLIT will union the input constraints from all its children's
output constraints. This could hurt conciseness as the input constraints from different children
might be able to be merged, particularly if their significant fields, vs. the "don't care"
fields, do not overlap. But that would require delaying the instantiations of the "don't care"
fields by the downstream operators and setting the criteria on instantiating the "don't care"
fields by the downstream operators. The complexity is judged to be not worth of the effort.
It might be left for a future release.

== Interface Illustratable ==

A new interface, Illustratable, will be introduced to support all ILLUSTRATE-related operations.
Each physical relational operator will implement it as a fully functional operation; while
each physical expression operator's implementation will throw an exception.

== Exceptions ==

The SAMPLE operator will be made an exception in that it will not work under ILLUSTRATE. The
reason is that for SAMPLE, realism is of uttermost importance, which basically forbids any
data synthesis. On the other hand, the prunning would hurt the sample sizing purpose.

Map-side join/cogroup is not supported in ILLUSTRATE as they are performance-related not data-centric

STREAM and UDF do not support data synthesis now. In the future, after hooks are introduced
to allow the synthesis function in the external context, both can be enabled.

== Schema Display ==

The schema foe illustrated aliases will be displayed alongside.

== ILLUSTRATE Script ==

Currently ILLUSTRATE only works on one alias at a time. This is more aimed for interactive
use. The support of ILLUSTRATE on a script is to be added. For non-aliased leaf nodes like
STORE, the whole statement will be used as an identifier for the results displayed.

Command flags similar to those for EXPLAIN might need to be suuported in this release.

== Local Mode ==

Existing implementation uses a limited edition of "true local mode" that bypasses most call
paths from physical plan to map-reduce plan, and a different call path has to be used just
for ILLUSTRATE purpose.

The new version will use the "real" call paths from physical plan to map-reduce plan, as by
all other PIG executions. But just before launch the M/R jobs by the launcher, all physical
plans contained in the map-reduce plans are transformed to ILLUSTRATE-specific ones before
a "local runner", in place of the M/R launcher, is fired. The "local runner" will run the
pipeline locally through invocations of the "map" and "reduce" methods of the mapper and reducer
objects respectively without any M/R jobs launched. It will also take inputs from and direct
outputs to in-memory data structures as set up properly by the ILLUSTRATE-specific physical

== Future Enhancements ==

In the future, the following enhancements might be desired.

=== UDF Support ===

Hooks are provided so that UDFs can tell PIG what equivalence classes they will have, what
post process after the pruning-synthesis-pruning procedure is to be performed, and how the
"reverse" results can be generated to enhance the completeness.

=== Data Synthesis in "Multifaceted" Filtering beyond the Independent Primitive Logical Expressions

Although it might be too hard to conceive a fully functional approach, more advanced types
than the logical expressions of independent primitive logical expressions might be desired
for in the future.

=== Data Synthesis on Expressions/Functions ===

Data synthesis may be desired on expressions and functions.

=== STREAM ===

Refer to the Operator-specific feature section for STREAM.

=== SPLIT ===

In upstream synthesis, input constrains from children might be consolidated instead of being
unioned in order to boost conciseness. It might require delaying instantiating the "don't
care" fields by the downstream operators during their synthesis process, and setting up the
instantiation criteria, a "richer constraint language", by the downstream operators for the
instantiation step performed later by some upstream operator.

View raw message