hadoop-pig-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Pi Song (JIRA)" <j...@apache.org>
Subject [jira] Updated: (PIG-143) Proposal for refactoring of parsing logic in Pig
Date Mon, 10 Mar 2008 13:57:46 GMT

     [ https://issues.apache.org/jira/browse/PIG-143?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Pi Song updated PIG-143:
------------------------

    Component/s: impl
    Description: 

h2. Pig Script Parser Refactor Proposal 
This is my initial proposal on pig script parser refactor work. Please note that I need your
opinions for improvements.

*Problem*

The basic concept is around the fact that currently we do validation logics in parsing stage
(for example, file existence checking) which I think is not clean and difficult to add new
validation rules. In the future, we will need to add more and more validation logics to improve
usability.

*My proposal:-*  (see [^ParserDrawing.png])
- Only keep parsing logic in the parser and leave output of parsing logic being unchecked
logical plans. (Therefore the parser only does syntactic checking)
- Create a new class called LogicalPlanValidationManager which is responsible for validations
of the AST from the parser.
- A new validation logic will be subclassing LogicalPlanValidator 
- We can chain a set of LogicalPlanValidators inside LogicalPlanValidationManager to do validation
work. This allows a new LogicalPlanValidator to be added easily like a plug-in. 
- This greatly promotes modularity of the validation logics which  is +particularly good when
we have a lot of people working on different things+ (eg. streaming may require a special
validation logic)
- We can set the execution order of validators
- There might be some backend specific validations needed when we implement new execution
engines (For example a logical operation that one backend can do but others can't).  We can
plug-in this kind of validations on-the-fly based on the backend in use.

*List of LogicalPlanValidators extracted from the current parser logic:-*

- File existence validator
- Alias existence validator

*Logics possibly be added in the very near future:-*
- Streaming script test execution
- Type checking + casting promotion + type inference
- Untyped plan test execution
- Logic to prevent reading and writing from/to the same file

The common way to implement a LogicalPlanValidator will be based on Visitor pattern. 

*Cons:-*
 - By having every validation logic traversing AST from the root node every time, there is
a performance hit. However I think this is neglectable due to the fact that Pig is very expressive
and normally queries aren't too big (99% of queries contain no more than 1000 AST nodes).

*Next Step:-*

LogicalPlanFinalizer which is also a pipeline except that each stage can modify the input
AST. This component will generally do a kind of global optimizations.

*Further ideas:-*
- Composite visitor can make validations more efficient in some cases but I don't think we
need
- ASTs within the pipeline never change (read-only) so validations can be done in parallel
to improve responsiveness. But again I don't think we need this unless we have so many I/O
bound logics.
- The same pipeline concept can also be applied in physical plan validation/optimization.


  was:
This is  a place holder for me to come up with a complete proposal. In the mean time, I definitely
need your opinions!!!

The basic concept is that now we do validation logic in parsing stage (for example, file existence
checking) which I think is not clean and difficult to add new validation rules.

The way I propose briefly:-
- Only keep parsing logic in the parser and leave output of parsing logic being unchecked
logical plans.
- Create a new class called LogicalPlanValidatorManager which is responsible for validation
job.
- A new validation logic will be subclassing LogicalPlanValidator
- We can implement chaining of LogicalPlanValidator inside LogicalPlanValidatorManager to
allow new LogicalPlanValidator to be added easily. When plugging in new logic, we do it here.
Therefore a new LogicalPlanValidator can be implemented like a plug-in.

Here is a list of possible LogicalPlanValidators in my mind (Please add what you want):- 
- The first LogicalPlanValidator to be implemented is FileExistence validator which is from
the current logic we have.
- Second LogicalPlanValidator is to sort out filename conflicts (At the moment you can save/load
same file over and over again in the same plan, this is very confusing. Possibly we should
not allow same file name in any single plan?)
- Test run of streaming scripts before going to real execution
- Meta data checking + type system checking as mentioned in Pig-142

The common way to implement a LogicalPlanValidator is based on Visitor pattern. Whether this
is universal for all cases or not, I need to think through more.

According to this, parsing errors will be detected first in the parsing stage. Errors from
validations are detected in the priority order that LogicalPlanValidators are organized in
LogicalPlanValidatorManager.

This proposal only applies to the LogicalPlan. For PhysicalPlan, where validation logics (backend
specific) are required. The same concept can be applied.



Refined version.

> Proposal for refactoring of parsing logic in Pig
> ------------------------------------------------
>
>                 Key: PIG-143
>                 URL: https://issues.apache.org/jira/browse/PIG-143
>             Project: Pig
>          Issue Type: Improvement
>          Components: impl
>            Reporter: Pi Song
>            Assignee: Pi Song
>         Attachments: ParserDrawing.png
>
>
> h2. Pig Script Parser Refactor Proposal 
> This is my initial proposal on pig script parser refactor work. Please note that I need
your opinions for improvements.
> *Problem*
> The basic concept is around the fact that currently we do validation logics in parsing
stage (for example, file existence checking) which I think is not clean and difficult to add
new validation rules. In the future, we will need to add more and more validation logics to
improve usability.
> *My proposal:-*  (see [^ParserDrawing.png])
> - Only keep parsing logic in the parser and leave output of parsing logic being unchecked
logical plans. (Therefore the parser only does syntactic checking)
> - Create a new class called LogicalPlanValidationManager which is responsible for validations
of the AST from the parser.
> - A new validation logic will be subclassing LogicalPlanValidator 
> - We can chain a set of LogicalPlanValidators inside LogicalPlanValidationManager to
do validation work. This allows a new LogicalPlanValidator to be added easily like a plug-in.

> - This greatly promotes modularity of the validation logics which  is +particularly good
when we have a lot of people working on different things+ (eg. streaming may require a special
validation logic)
> - We can set the execution order of validators
> - There might be some backend specific validations needed when we implement new execution
engines (For example a logical operation that one backend can do but others can't).  We can
plug-in this kind of validations on-the-fly based on the backend in use.
> *List of LogicalPlanValidators extracted from the current parser logic:-*
> - File existence validator
> - Alias existence validator
> *Logics possibly be added in the very near future:-*
> - Streaming script test execution
> - Type checking + casting promotion + type inference
> - Untyped plan test execution
> - Logic to prevent reading and writing from/to the same file
> The common way to implement a LogicalPlanValidator will be based on Visitor pattern.

> *Cons:-*
>  - By having every validation logic traversing AST from the root node every time, there
is a performance hit. However I think this is neglectable due to the fact that Pig is very
expressive and normally queries aren't too big (99% of queries contain no more than 1000 AST
nodes).
> *Next Step:-*
> LogicalPlanFinalizer which is also a pipeline except that each stage can modify the input
AST. This component will generally do a kind of global optimizations.
> *Further ideas:-*
> - Composite visitor can make validations more efficient in some cases but I don't think
we need
> - ASTs within the pipeline never change (read-only) so validations can be done in parallel
to improve responsiveness. But again I don't think we need this unless we have so many I/O
bound logics.
> - The same pipeline concept can also be applied in physical plan validation/optimization.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message