pig-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <wikidi...@apache.org>
Subject [Pig Wiki] Update of "PigTypesDesign" by AlanGates
Date Mon, 07 Jan 2008 22:30:50 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Pig Wiki" for change notification.

The following page has been changed by AlanGates:

New page:
= Pig Types Design =

== Purpose ==
This document presents the design for adding types to pig.  See the
[:PigTypesFunctionalSpec:functional specification] for complete details of
what changes will be made.

== Architectural Changes ==
Several significant architectural changes will be made as part of adding types to
   1. The classes representing data types will be significantly changed.
   2. The way expression evaluation is done will be significantly changed.
   4. The schema class will be modified.

=== Data Type Classes ===
Currently there are four classes extending Datum:  !DataBag, !DataMap, !DataAtom,
and Tuple.  

In the new design, the class Datum will be removed.  Tuple and !DataBag will remain, but be
converted to interfaces.  Maps and all atomic types will be
implemented by already existing java objects, according to the following table:
|| '''Pig data type''' || '''Implementing Class'''      ||
|| Bag                 || org.apache.pig.data.DataBag   ||
|| Tuple               || org.apache.pig.data.Tuple     ||
|| Map                 || java.util.Map<Object, Object> ||
|| Integer             || java.lang.Integer             ||
|| Long                || java.lang.Long                ||
|| Float               || java.lang.Float               ||
|| Double              || java.lang.Double              ||
|| Chararray           || java.lang.String              ||
|| Bytearray           || byte[]                        ||

Tuple and DataBag will be converted to interfaces to allow alternate implementations by others
who wish to use their own form of tuples or bags.  Default
implementations (!DefaultTuple and !DefaultAbstractBag, !DefaultDataBag, !SortedDataBag, and
!DistinctDataBag) will be provided.  As part of this
!BagFactory will be rewritten to be abstract.  A !DefaultBagFactory will be provided that
will implement the functionality currently in !BagFactory. An
abstract factory will be added for tuples, !TupleFactory.  A !DefaultTupleFactory will be
provided.  All Tuple construction will be changed to be done
via the factory.  Each of the factories will have a getFactory() method that will determine
which actual factory to instantiate.  This determination will
be done by checking configuration parameters.  This can be used by those who wish to use their
own tuples or bags to override the default tuple or bag

The class of static values !DataType will be added to provide values for each data type. 
static const final byte values are used rather than an enum to
avoid the object overhead.

The class of static functions !DataReadWriteHelper will be added to provide for (de)serialization
of standard types.  This will allow users who implement
their own version of Tuple or Bag to still use standard methods to (de)serialize the contents.
 Methods are not provided in this class for Tuple and
!DataBag because those implement !Writable and thus will already have read and write methods.

Java objects (rather than pig specific objects) were chosen to represent most types in order
to take advantage of all the language features of each type and avoid the need to either
reimplement common functionality (such as increment for integer) or force conversion to and
from pig specific data types and java types.  First class
language objects will also be easier and more natural for UDF writers to work with.  

Objects were chosen over base types (e.g. Integer over int) in order to allow the concept
of null, and to ease the interface and implementation (ie a
tuple can be implemented as an !ArrayList of objects).

Diagram 1 presents the new types architecture in more detail:

Diagram 1:  Data Type Classes

=== Expression Evaluation ===
Currently, expression evaluation is handled by two sets
of classes, those extending Cond and those extending !EvalFunc.  Also, some
operations needed in expression evaluation (binary condition, constants,
column projection, map dereferencing) are extensions of !EvalSpec.  This will be reworked
to create one set of classes that handle expression evalution.

Keeping with the changes proposed for splitting logical and physical plans, logical expressions
and physical expressions will be created.  The logical
expressions will be used for type checking and determining the correct physical expressions
to put in the execution plan.  The physical expressions will
be used for evaluation.

New classes !LogicalExpr and !PhysicalExpr will be created.  Each type of expression will
have one logical class that extends !LogicalExpr and one or
more physical classes that extend !PhysicalExpr.  Physical expressions may have more than
one class implementing a given expression in order to handle
type overloading of that expression.  Where possible, this will be handled by generics, but
where not possible it will be handled by multiple classes
(since java generics do not allow the developer to provide a specific realization for a given
type).  For example, consider the physical expression plus.
This can be done as a generic, because the java '+' operator that it will implement can be
applied to Integer, Long, Float, or Double.  But the physical
expression equals cannot be done as a generic.  This is because equals() cannot be applied
to a java bytearray.  So a separate class will be necessary to
implement equals for bytearray types.  Separate classes will be created rather than putting
functionality for all types in one class to avoid if/else
chains in the expression evaluation.

These new classes will cause the following changes:
   * The class Cond and its extensions will be removed.
   * The !EvalFunc and !FuncEvalSpec classes will be replaced by !LogicalFuncExpr and !PhysicalFuncExpr.
   * !MapLookupSpec will be replaced by !LogicalMapDerefExpr and !PhysicalMapDerefExpr.
   * !ConstSpec will be replaced by !LogicalConstExpr and !PhysicalConstExpr.
   * !ProjectSpec will be replaced by !LogicalFieldExpr and !PhysicalFieldExpr.
   * !BinCondSpec will be replaced by !LogicalBinCondExpr !PhysicalBinCondExpr.
   * A new extension of !EvalSpec, !ExprSpec will be added.  This will contain a single expression.

TODO:  Should !StarSpec be replaced too, or should it stay an !EvalSpec?

Diagram 2: Extenders of !EvalSpec in new architecture

Diagram 3: !LogicalExpr and its extenders

Diagram 4: !PhysicalExpr and its extenders

=== Schema ===
The concept of Schema will be changed.  Currently, a Schema represents either the alias for
an atomic field (!AtomSchema) or the (possibly) set of
aliases for a complex field (!TupleSchema).  This will be broken out into two classes, !FieldSchema
and Schema.  !FieldSchema will always represent one
field, regardless of the data type of the field.  It will contain information on alias name,
data type, and (if the field is a tuple) the Schema.  Schema
will be only for tuples, and will allow the caller to fetch the !FieldSchema for a field based
either on position or alias.

Diagram 5: Schema class in the new architecture

== Detailed Design ==

=== Parser Changes ===
The parser will be changed in several areas:
   * Changes in AS syntax to support data type declaration.
   * Operator selection and type checking will be added.
   * Non-string constants will be added.
   * Changes will be made to support casts.
   * Changes will be made to DEFINE to support types for functions.

==== AS Syntax ====
AS statements will be modified to accept type specification as part of the
schema specification.  This will require changes in the grammar as well as the
construction of the Schema object.

==== Type Checking ====
Type checking will be done by a !LogicalExprVisitor on the expressions contained in each !ExprSpec.
 This will be done after the parse phase has
completed.  It could be done during parsing.  However, keeping this separate will make the
code in the parser simpler.  It will also allow multiple
implementations of the parser (as we currently have for the pig pen code) without requiring
multiple implementations of the type checker.

As part of type checking, type promotion will be done.  For example, if the user requests
the addition of an integer and a double, the type checker will
handle promoting the integer to a double by inserting a cast operator between the integer
and the plus expression.  So what started out as:
`PLUS(3, 3.0)` will become `PLUS(castIntToDouble(3), 3.0)`.  This is necessary even for cases
where java would do the promotion for us (as in our example of
double plus integer) because we do no want to assume the implementation of the backend.  We
may change that at some point in the future.

==== Physical Expression Selection ====
As noted above, for some expressions there will be multiple implementations of the expression,
based on the type.  This will be accomplished in some
cases by generics, and in some cases by separate class implementations.  The code that compiles
the logical plan into a physical plan will need to be
able to choose the correct physical implementation of a logical expression.  This will be
achieved by having tables defined for each logical operator.
These tables will be a matrix with rows and columns for each type.  The compiler code can
then look up the appropriate operator in this matrix simply
using the byte codes for the types.  For example, the table for plus will look like:

plusOperator[][] = {
   // bag, tuple, map,   int,               long,           float,           double,     
     chararray, bytearray
   {null,  null,  null,  null,              null,           null,            null,       
     null,      null}, // bag
   {null,  null,  null,  null,              null,           null,            null,       
     null,      null}, // tuple
   {null,  null,  null,  null,              null,           null,            null,       
     null,      null}, // map
   {null,  null,  null,  ExprPlus<Integer>, null,           null,            null, 
           null,      null}, // int
   {null,  null,  null,  null,              ExprPlus<Long>, null,            null, 
           null,      null}, // long
   {null,  null,  null,  null,              null,           ExprPlus<Float>, null, 
           null,      null}, // float
   {null,  null,  null,  null,              null,           null,            ExprPlus<Double>,
null,      null}, // double
   {null,  null,  null,  null,              null,           null,            null,       
     null,      null}, // chararray
   {null,  null,  null,  null,              null,           null,            null,       
     null,      null}  // bytearray

When the compiler finds a plus operator with two integers as arguments, it will look up
plusOperatorSelection[!DataType::!TypeInteger][!DataType::!TypeInteger] and then instantiate
the indicated class.

Recall that the type checker will be placing cast operators in to handle type promotion. 
This is why plus is not defined for double plus integer, etc.

If the compiler looks up an entry in the table and the result is null, it will give
an error informing the user that an operation of the requested type is not

==== Constants ====
The parser will change to support non-string constants.  For a
specification of constant syntax see the [:PigTypesFunctionSpec:functional specification].
The parser will support these constants, and encode them as !ConstExpr objects.

==== Casts ====
The parser will change to support cast syntax.  For the syntax see the [:PigTypesFunctionSpec:functional
Cast operators will be selected via tables in the same way as other operators.

==== DEFINE Syntax ====
To support changes to define syntax, a symbol table will be added to the
parser.  This table will track functions that have been defined with type
signatures.  Any function used in the query will be checked against this
table.  If the function used is not in the symbol table than the type checker
will assign a bytearray type to the function result.  If the function is
present in the symbol table it will be checked that the argument usage matches
what was specified in the symbol table and that the arguments are of the
type specified in the symbol table.  If either of these criteria are not met
the parser will throw an error.

=== Changes to the Physical Plan and Expression Evaluation ===
As described in [#Expression_Evaluation] above, a new class !ExprSpec will be added to handle
all expression evaluation.
In its constructor it
will visit its associated expression tree to discover every expression in the tree that
needs to get data from its input tuple.  It will construct this into a map of
offsets and expressions.

On each call to !DataCollector::add it will then use this map to populate the
appropriate expressions with the values from the current tuple.  It will then
call eval() on the expression at the top of the tree, and return the resulting
value as its output.

=== Changing Current Operators ===
Currently the regular expressions in the Matches operator are compiled for every
row.  This will be changed so that they are compiled once in the constructor
of the operator and just matched on each row.  In addition, regular expressions
will need to be supported on byte[] data where no coding is specified.  

TODO:  Need to investigate third party regular expression libraries to see what
we can find that will handle regular expressions on byte sequences.  Failing this, we
may be able to take a third party library and modify it to work with byte arrays.
We really don't want to write our own regular expressions. 

=== Null Handling ===
Changes will be made to each expression operator to handle null values.

=== Changes to Arithmetic Expressions as Function Arguments ===
Expression tree construction in the parser will be changed to match the new
semantics specified

View raw message