hadoop-pig-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Alan Gates (JIRA)" <j...@apache.org>
Subject [jira] Commented: (PIG-143) Proposal for refactoring of parsing logic in Pig
Date Wed, 16 Apr 2008 23:45:26 GMT

    [ https://issues.apache.org/jira/browse/PIG-143?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12589776#action_12589776

Alan Gates commented on PIG-143:

Comments on the validator patch:

1) Make Validator have a "has a" relationship to Visitor instead of "is a" seems like it has
some downsides.  First, I suspect all validators are going to be the same boiler plate code:
instantiate visitor, visit, translate any exceptions into message collector.  Two, the visitors
that are doing the actual validation are forced to throw exceptions rather than write directly
to the message collector.  This is inefficient.  If instead Validator extended Visitor, then
you avoid both these issues.  I think this "is a" relationship is appropriate as there will
never be a validator that isn't a visitor.

2)  In TypeCheckingVisitor, why is the function "visit(ExpressionOperator)", which is just
a big switch statement to find the right type, necessary?  Won't java automatically pick the
right visit() based on the class type that is passed?

3) For arithmetic expressions, I think we also want to allow it if both sides are bytearray.
 In that case we cast both to double (since it's the most generic).  In general, see the charts
at http://wiki.apache.org/pig/PigTypesFunctionalSpec to determine what conversions we want
to do.

And answers to previous questions:

[pi] From PigTypeFunctionalSpec wiki "bag(type | unknown)" ==> So bags support other types
other than tuple?

[alan] Originally that was the plan, and I think we'd still like to get there
some day.  We haven't implemented that yet.


[pi] If I have tuple schemas [ int, int, int] and [long, long, long]. Are they compatible?
This seems like parameters of type parameters problem in Java (e.g. List<List<Dog>>
and List<ArrayList<Dog>>). I will have to think through it if we still don't know
the solution.

[alan] On my to do list is to add a method to Schema to do schema merging.
This would handle exactly issues like this, where type promotion needs to be
done, and throwing errors in cases where types can't be promoted.


[pi] Haven't implemented explicit casting insertion yet. This alters the plan structure and
existing walker might not work correctly.

[alan] Hopefully the existing walker will work in this case.  I've used
similar code before to walk plans that were being altered as they were walked,
so I expect this to work, or at least work after a little tweaking.

> Proposal for refactoring of parsing logic in Pig
> ------------------------------------------------
>                 Key: PIG-143
>                 URL: https://issues.apache.org/jira/browse/PIG-143
>             Project: Pig
>          Issue Type: Sub-task
>          Components: impl
>            Reporter: Pi Song
>            Assignee: Pi Song
>         Attachments: ParserDrawing.png, pigtype_cycle_check.patch, validation_part1.patch,
validation_part2.patch, validation_v2.patch
> h2. Pig Script Parser Refactor Proposal 
> This is my initial proposal on pig script parser refactor work. Please note that I need
your opinions for improvements.
> *Problem*
> The basic concept is around the fact that currently we do validation logics in parsing
stage (for example, file existence checking) which I think is not clean and difficult to add
new validation rules. In the future, we will need to add more and more validation logics to
improve usability.
> *My proposal:-*  (see [^ParserDrawing.png])
> - Only keep parsing logic in the parser and leave output of parsing logic being unchecked
logical plans. (Therefore the parser only does syntactic checking)
> - Create a new class called LogicalPlanValidationManager which is responsible for validations
of the AST from the parser.
> - A new validation logic will be subclassing LogicalPlanValidator 
> - We can chain a set of LogicalPlanValidators inside LogicalPlanValidationManager to
do validation work. This allows a new LogicalPlanValidator to be added easily like a plug-in.

> - This greatly promotes modularity of the validation logics which  is +particularly good
when we have a lot of people working on different things+ (eg. streaming may require a special
validation logic)
> - We can set the execution order of validators
> - There might be some backend specific validations needed when we implement new execution
engines (For example a logical operation that one backend can do but others can't).  We can
plug-in this kind of validations on-the-fly based on the backend in use.
> *List of LogicalPlanValidators extracted from the current parser logic:-*
> - File existence validator
> - Alias existence validator
> *Logics possibly be added in the very near future:-*
> - Streaming script test execution
> - Type checking + casting promotion + type inference
> - Untyped plan test execution
> - Logic to prevent reading and writing from/to the same file
> The common way to implement a LogicalPlanValidator will be based on Visitor pattern.

> *Cons:-*
>  - By having every validation logic traversing AST from the root node every time, there
is a performance hit. However I think this is neglectable due to the fact that Pig is very
expressive and normally queries aren't too big (99% of queries contain no more than 1000 AST
> *Next Step:-*
> LogicalPlanFinalizer which is also a pipeline except that each stage can modify the input
AST. This component will generally do a kind of global optimizations.
> *Further ideas:-*
> - Composite visitor can make validations more efficient in some cases but I don't think
we need
> - ASTs within the pipeline never change (read-only) so validations can be done in parallel
to improve responsiveness. But again I don't think we need this unless we have so many I/O
bound logics.
> - The same pipeline concept can also be applied in physical plan validation/optimization.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message