hadoop-pig-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Olga Natkovich" <ol...@yahoo-inc.com>
Subject RE: switching to different parser in Pig
Date Tue, 25 Aug 2009 22:01:26 GMT
To answer Santhosh's question. I think the plan is to move to Jflex and CUP but when that happens
is a matter of priorities and resources which are not clear at this point. We do welcome contributions
;).

Olga

-----Original Message-----
From: Thejas Nair [mailto:tejas@yahoo-inc.com] 
Sent: Tuesday, August 25, 2009 12:52 PM
To: pig-dev@hadoop.apache.org; Dmitriy Ryaboy
Cc: pi.songs@gmail.com
Subject: Re: switching to different parser in Pig

Jflex is covered by GPL, but code generated by it is not. Only the code that
is generated by Jflex goes into pig.jar.
We can't checkin Jflex.jar into svn, ivy will be setup to download it from
maven repository.
-Thejas



On 8/25/09 11:57 AM, "Dmitriy Ryaboy" <dvryaboy@cloudera.com> wrote:

> Santosh,
> Am I missing something about Jflex licensing? I thought that it being
> GPL, we can't package it with apache-licensed software, which prevents
> it from being a viable option (regardless of technical merits)
> 
> -Dmitriy
> 
> On Tue, Aug 25, 2009 at 1:58 PM, Santhosh Srinivasan<sms@yahoo-inc.com> wrote:
>> Its been 6 months since this topic was discussed but we don't have
>> closure on it.
>> For SQL on top of Pig, we are using Jflex and CUP
>> (https://issues.apache.org/jira/browse/PIG-824). If we have decided on
>> the right parser, can we have a plan to move the other parsers in Pig to
>> the same technology?
>> 
>> Thanks,
>> Santhosh
>> 
>> PS: I am assuming we are not moving to Antlr.
>> 
>> 
>> -----Original Message-----
>> From: Alan Gates [mailto:gates@yahoo-inc.com]
>> Sent: Tuesday, February 24, 2009 10:17 AM
>> To: pig-dev@hadoop.apache.org; pi.songs@gmail.com
>> Subject: Re: switching to different parser in Pig
>> 
>> Sorry, after I sent that email yesterday I realized I was not very
>> clear.  I did not mean to imply that antlr didn't have good
>> documentation or good error handling.  What I wanted to say was we
>> want all three of those things, and it didn't appear that antlr
>> provided all three, since it doesn't separate out scanner and parser.
>> Also, from my viewpoint, I prefer bottom up LALR(1) parsers like yacc
>> to top down parsers like javacc.  My understanding is that antlr is
>> top down like javacc.  My reasoning for this preference is that parser
>> books and classes have used those for decades, so there are a large
>> number of engineers out there (including me :) ) who know how to work
>> with them.  But maybe antlr is close enough to what we need.  I'll
>> take a deeper look at it before I vote officially on which way we
>> should go.
>> 
>> As for loops and branches, I'm not saying we need those in Pig Latin.
>> We need them somehow.  Whether it's better to put them in Pig Latin or
>> imbed pig in a existing script language is an ongoing debate.  I don't
>> want to make a decision now that effectively ends that debate without
>> buy in from those who feel strongly that Pig Latin should include
>> those constructs.
>> 
>> I agree with you that we should modify the logical plan to support
>> this rather than add another layer.  As for active development, the
>> only thing I'm aware of is we hope to start working on a more robust
>> optimizer for pig soon, and that will require some additional
>> functionality out of the logical operators, but it shouldn't cause any
>> fundamental architectural changes.
>> 
>> Alan.
>> 
>> 
>> On Feb 24, 2009, at 1:27 AM, pi song wrote:
>> 
>>> (1) Lack of good documentation which makes it hard to and time
>>> consuming
>>> to learn javacc and make changes to Pig grammar
>>> <== ANTLR is very very well documented.
>>> http://www.pragprog.com/titles/tpantlr/the-definitive-antlr-reference
>>> http://media.pragprog.com/titles/tpantlr/toc.pdf
>>> http://www.antlr.org/wiki/display/ANTLR3/ANTLR+3+Wiki+Home
>>> 
>>> (2) No easy way to customize error handling and error messages
>>> <== ANTLR has very extensive error handling support
>>> http://media.pragprog.com/titles/tpantlr/errors.pdf
>>> 
>>> (3) Single path that performs both tokenizing and parsing
>>> <== What is the advantage of decoupling tokenizer and parsing ?
>>> 
>>> In addition, "Composite Grammar" is very useful for keeping the parser
>>> modular. Things that can be treated as sub-languages such as bag
>>> schema
>>> definition can be done and unit tested separately.
>>> 
>>> ANTLRWorks http://www.antlr.org/works/index.html
>>> <http://www.antlr.org/works/index.html>also
>>> makes grammar development very efficient. Think about IDE that helps
>>> you
>>> debug your code (which is grammar).
>>> 
>>> One question, is there any use case for branching and loops? The
>>> current Pig
>>> is more like a query (declarative) language. I don't really see how
>>> loop
>>> constructs would fit. I think what Ted mentioned is more embedding
>>> Pig in
>>> other languages and use those languages to do loops.
>>> 
>>> We should think about how the logical plan layer can be made simpler
>>> for
>>> external use so don't have to introduce a new layer. Is there any
>>> major
>>> active development on it? Currently I have more spare time and
>>> should be
>>> able to help out. (BTW, I'm slow because this is just my hobby. I
>>> don't want
>>> to drag you guys)
>>> 
>>> Pi Song
>>> 
>>> On Tue, Feb 24, 2009 at 6:23 AM, nitesh bhatia
>> <niteshbhatia008@gmail.com
>>>> wrote:
>>> 
>>>> Hi
>>>> I got this info from javacc mailing lists. This may prove helpful:
>>>> 
>>>> 
>>>> 
>> ------------------------------------------------------------------------
>> ------------------------------------------------------------------------
>> ----------------
>>>> -----Original Message----- From: Ken Beesley
>>>> [mailto:ken....@xrce.xerox.com] Sent: Wednesday, August 18, 2004 2:56
>>>> PM To: javacc Subject: [JavaCC] Alternatives to JavaCC (was Hello
>>>> All)
>>>> 
>>>> Vicas wrote:
>>>> 
>>>> Hello All
>>>> 
>>>> Kindly let me know other parsers available which does the same job as
>>>> javacc.
>>>> 
>>>> It would be very nice of you if you can send me some documentation
>>>> related to this.
>>>> 
>>>> Thanks Vikas
>>>> 
>>>> (Correction and clarifications to the following would be _very_
>>>> welcome. I'm very likely out of date.)
>>>> 
>>>> Of course, no two software tools are likely to do _exactly_ the same
>>>> job. Someone already pointed you to ANTLR, which is probably the
>>>> best-known alternative to JavaCC. Another possibility is SableCC.
>>>> http://sablecc.org
>>>> 
>>>> The criteria include stability, documentation, language of the parser
>>>> generated, and abstract-syntax-tree building.
>>>> 
>>>> When I last looked (a couple of years ago) at ANTLR, SableCC and
>>>> JavaCC, I chose JavaCC for the following reasons:
>>>> 
>>>> 1. ANTLR could not handle Unicode input. Things change, of course, so
>>>> ANTLR might now be more Unicode-friendly. Unicode was important to
>>>> me,
>>>> so this was a big factor in my decision.
>>>> 
>>>> On the plus side for ANTLR, it has better abstract-syntax-tree
>>>> building capabilities (in my opinion) than JJTree/JavaCC. You can
>>>> learn to use JJTree commands, but it's not easy for most people.
>>>> 
>>>> And ANTLR can generate either a Java or a C++ parser. JavaCC
>>>> generates
>>>> only Java parsers.
>>>> 
>>>> Another concern about ANTLR was that it was reputed to change a lot
>>>> as
>>>> the guru, Terence Parr, experimented with new syntax and
>>>> functionality. JavaCC, at least at the time, was reputed to be more
>>>> stable, perhaps stable to a fault. I wanted stability and
>>>> reliability.
>>>> 
>>>> 2. SableCC is much like JavaCC; it generates a Java parser from a
>>>> grammar description; but it had, in my opinion, less flexible
>>>> abstract-syntax-tree building than JJTree/JavaCC. In SableCC (when I
>>>> looked at it), the AST it built was always a direct reflection of
>>>> your
>>>> grammar, generating one tree node for each grammar expansion involved
>>>> in a parse, much like using JavaCC with Java Tree Builder (JTB
>>>> http://www.cs.purdue.edu/jtb/). When using JavaCC, JTB is the
>>>> alternative to using JJTree.
>>>> 
>>>> Using SableCC, or the combination JavaCC/JTB, should be _very_
>>>> similar
>>>> indeed.
>>>> 
>>>> In my opinion, SableCC and JavaCC/JTB have made a conscious choice to
>>>> simplify AST building--you get trees that reflect the expansions in
>>>> your grammar. Period. But often these default trees will be big, full
>>>> of extraneous nodes that reflect precedence hierarchies in the
>>>> recursive-descent parsing. If you want to have more control over AST
>>>> building, to get more compact and tailored ASTs, you need to pay the
>>>> price of learning JJTree.
>>>> 
>>>> Assuming that you need to build ASTs, with JavaCC you have the choice
>>>> between JJTree and JTB. With SableCC, when I last looked at it, you
>>>> only get the JTB-like option.
>>>> 
>>>> *******
>>>> 
>>>> (Again, corrections and expansions would be much appreciated.)
>>>> 
>>>> Ken Beesley
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>> ------------------------------------------------------------------------
>> ------------------------------------------------------------------------
>> ---
>>>> 
>>>> 
>>>> Of course, no two software tools are likely to do _exactly_ the same
>>>> job. Someone already pointed you to ANTLR, which is probably the
>>>> best-known alternative to JavaCC. Another possibility is SableCC.
>>>> http://sablecc.org
>>>> 
>>>> The criteria include stability, documentation, language of the parser
>>>> generated, and abstract-syntax-tree building.
>>>> 
>>>> When I last looked (a couple of years ago) at ANTLR, SableCC and
>>>> JavaCC, I chose JavaCC for the following reasons:
>>>> 
>>>> 1. ANTLR could not handle Unicode input. Things change, of course, so
>>>> ANTLR might now be more Unicode-friendly. Unicode was important to
>>>> me,
>>>> so this was a big factor in my decision.
>>>> 
>>>> On the plus side for ANTLR, it has better abstract-syntax-tree
>>>> building capabilities (in my opinion) than JJTree/JavaCC. You can
>>>> learn to use JJTree commands, but it's not easy for most people.
>>>> 
>>>> And ANTLR can generate either a Java or a C++ parser. JavaCC
>>>> generates
>>>> only Java parsers.
>>>> 
>>>> Another concern about ANTLR was that it was reputed to change a lot
>>>> as
>>>> the guru, Terence Parr, experimented with new syntax and
>>>> functionality. JavaCC, at least at the time, was reputed to be more
>>>> stable, perhaps stable to a fault. I wanted stability and
>>>> reliability.
>>>> 
>>>> 2. SableCC is much like JavaCC; it generates a Java parser from a
>>>> grammar description; but it had, in my opinion, less flexible
>>>> abstract-syntax-tree building than JJTree/JavaCC. In SableCC (when I
>>>> looked at it), the AST it built was always a direct reflection of
>>>> your
>>>> grammar, generating one tree node for each grammar expansion involved
>>>> in a parse, much like using JavaCC with Java Tree Builder (JTB
>>>> http://www.cs.purdue.edu/jtb/). When using JavaCC, JTB is the
>>>> alternative to using JJTree.
>>>> 
>>>> Using SableCC, or the combination JavaCC/JTB, should be _very_
>>>> similar
>>>> indeed.
>>>> 
>>>> In my opinion, SableCC and JavaCC/JTB have made a conscious choice to
>>>> simplify AST building--you get trees that reflect the expansions in
>>>> your grammar. Period. But often these default trees will be big, full
>>>> of extraneous nodes that reflect precedence hierarchies in the
>>>> recursive-descent parsing. If you want to have more control over AST
>>>> building, to get more compact and tailored ASTs, you need to pay the
>>>> price of learning JJTree.
>>>> 
>>>> Assuming that you need to build ASTs, with JavaCC you have the choice
>>>> between JJTree and JTB. With SableCC, when I last looked at it, you
>>>> only get the JTB-like option.
>>>> 
>>>> ----------
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> On Mon, Feb 23, 2009 at 10:06 PM, Alan Gates <gates@yahoo-inc.com>
>>>> wrote:
>>>>> We looked into antlr.  It appears to be very similar to javacc,
>>>>> with the
>>>>> added feature that the java code it generates is humanly
>>>>> readable.  That
>>>>> isn't why we want to switch off of javacc.  Olga listed the 3
>>>>> things we
>>>> want
>>>>> out of a parser that javacc isn't giving us (lack of docs, no easy
>>>>> customization of error handle, decoupling of scanning and
>>>>> parsing).  So
>>>>> antlr doesn't look viable.
>>>>> 
>>>>> In response to Pi's suggestion that we could use the logical plan,
>>>>> I hope
>>>> we
>>>>> could use something close to it.  Whatever we choose we want it to
>>>>> be
>>>>> flexible enough to represent richer language constructs (like
>>>>> branch and
>>>>> loop).  I'm not sure our current logical plan can do that.  At the
>>>>> same
>>>>> time, we don't need another layer of translation (we already have
>>>>> logical
>>>> ->
>>>>> physical -> mapreduce).  I would like to find a representation
>>>>> that could
>>>>> handle expressing the syntax and what is currently the logical plan.
>>>>> 
>>>>> Alan.
>>>>> 
>>>>> On Feb 20, 2009, at 5:15 PM, pi song wrote:
>>>>> 
>>>>>> Should be pretty close but we may need to cleanup the interface a
>>>>>> bit.
>>>>>> Then
>>>>>> the new parser  module can be switched in easily.
>>>>>> BTW, have we already got the solution for the new parser generator?
>>>>>> 
>>>>>> Pi
>>>>>> 
>>>>>> 
>>>>>> On Fri, Feb 20, 2009 at 9:03 PM, Ted Dunning
>>>>>> <ted.dunning@gmail.com>
>>>>>> wrote:
>>>>>> 
>>>>>>> 
>>>>>>> Probably nearly the same effect as you suggest.  Are the
>>>>>>> concepts at
>>>> the
>>>>>>> logical plan layer similar to those expressed in pig latin?  Or
>>>>>>> has a
>>>>>>> significant transformation occurred by then?
>>>>>>> 
>>>>>>> 
>>>>>>> On Fri, Feb 20, 2009 at 1:59 AM, pi song <pi.songs@gmail.com>
>>>>>>> wrote:
>>>>>>> 
>>>>>>>> Sounds good but how about exposing the logical plan layer
>>>>>>>> instead?
>>>>>>>> Wouldn't
>>>>>>>> that yield the same effect?  From python for example you
still
>>>>>>>> can
>>>>>>>> construct
>>>>>>>> a logical plan and give to Pig to execute.
>>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> --
>>>>>>> Ted Dunning, CTO
>>>>>>> DeepDyve
>>>>>>> 
>>>>>>> 
>>>>> 
>>>>> 
>>>> 
>>>> 
>>>> 
>>>> --
>>>> Nitesh Bhatia
>>>> Dhirubhai Ambani Institute of Information & Communication Technology
>>>> Gandhinagar
>>>> Gujarat
>>>> 
>>>> "Life is never perfect. It just depends where you draw the line."
>>>> 
>>>> visit:
>>>> http://www.awaaaz.com - connecting through music
>>>> http://www.volstreet.com - lets volunteer for better tomorrow
>>>> http://www.instibuzz.com - Voice opinions, Transact easily, Have fun
>>>> 
>> 
>> 


Mime
View raw message