lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Luis Alves (JIRA)" <j...@apache.org>
Subject [jira] Issue Comment Edited: (LUCENE-2039) Regex support and beyond in JavaCC QueryParser
Date Tue, 10 Nov 2009 23:01:29 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-2039?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12776139#action_12776139
] 

Luis Alves edited comment on LUCENE-2039 at 11/10/09 11:00 PM:
---------------------------------------------------------------

{quote}
Grant, JavaCC only generates parsers, a flag is a semantic check. You need to do a lot more
work to do those checks. 
First step would be to build a tree using jjtree. 
Then you need to build the symbol table and then you can traverse the tree to do your checks.
{quote}

In the new queryparser we don't use jjtree, but the same concept is implemented in the new
queryparser, 
the ouput from the SyntaxParser interface is a syntax tree, this tree is not related with
any lucene objects just like jjtree.
But I think this is a ugly solution.

I think if we use the new queryparser, it allows for multiple SyntaxParsers to use the same
Processors and the Builders.
And with a small implementation of a SyntaxParser(javacc, jflex, antlr, java tokenizer, etc),
you can use the same Processors and Builders to create a lucene query.
This will avoid duplicate code and allow for multiple syntaxes.

I don't want to be preacher here, but some of these problems are already solved in the new
queryparser framework, we just need to keep improving it, by adding more syntaxes, extensions
and features to it.

I know the new queryparser is not in main, but that can be fixed in 3.1, if the community
thinks is stable we should move it there.



      was (Author: lafa):
    {code}
Grant, JavaCC only generates parsers, a flag is a semantic check. You need to do a lot more
work to do those checks. First step would be to build a tree using jjtree. Then you need to
build the symbol table and then you can traverse the tree to do your checks.
{code}

In the new queryparser we don't use jjtree, but the same concept is implemented in the new
queryparser, 
the ouput from the SyntaxParser interface is a syntax tree, this tree is not related with
any lucene objects just like jjtree.
But I think this is a ugly solution.

I think if we use the new queryparser, it allows for multiple SyntaxParsers to use the same
Processors and the Builders.
And with a small implementation of a SyntaxParser(javacc, jflex, antlr, java tokenizer, etc),
you can use the same Processors and Builders to create a lucene query.
This will avoid duplicate code and allow for multiple syntaxes.

I don't want to be preacher here, but some of these problems are already solved in the new
queryparser framework, we just need to keep improving it, by adding more syntaxes, extensions
and features to it.

I know the new queryparser is not in main, but that can be fixed in 3.1, if the community
thinks is stable we should move it there.


  
> Regex support and beyond in JavaCC QueryParser
> ----------------------------------------------
>
>                 Key: LUCENE-2039
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2039
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: QueryParser
>            Reporter: Simon Willnauer
>            Assignee: Grant Ingersoll
>            Priority: Minor
>             Fix For: 3.1
>
>         Attachments: LUCENE-2039.patch
>
>
> Since the early days the standard query parser was limited to the queries living in core,
adding other queries or extending the parser in any way always forced people to change the
grammar file and regenerate. Even if you change the grammar you have to be extremely careful
how you modify the parser so that other parts of the standard parser are affected by customisation
changes. Eventually you had to live with all the limitation the current parser has like tokenizing
on whitespaces before a tokenizer / analyzer has the chance to look at the tokens. 
> I was thinking about how to overcome the limitation and add regex support to the query
parser without introducing any dependency to core. I added a new special character that basically
prevents the parser from interpreting any of the characters enclosed in the new special characters.
I choose the forward slash  '/' as the delimiter so that everything in between two forward
slashes is basically escaped and ignored by the parser. All chars embedded within forward
slashes are treated as one token even if it contains other special chars like * []?{} or whitespaces.
This token is subsequently passed to a pluggable "parser extension" with builds a query from
the embedded string. I do not interpret the embedded string in any way but leave all the subsequent
work to the parser extension. Such an extension could be another full featured query parser
itself or simply a ctor call for regex query. The interface remains quiet simple but makes
the parser extendible in an easy way compared to modifying the javaCC sources.
> The downsides of this patch is clearly that I introduce a new special char into the syntax
but I guess that would not be that much of a deal as it is reflected in the escape method
though. It would truly be nice to have more than once extension an have this even more flexible
so treat this patch as a kickoff though.
> Another way of solving the problem with RegexQuery would be to move the JDK version of
regex into the core and simply have another method like:
> {code}
> protected Query newRegexQuery(Term t) {
>   ... 
> }
> {code}
> which I would like better as it would be more consistent with the idea of the query parser
to be a very strict and defined parser.
> I will upload a patch in a second which implements the extension based approach I guess
I will add a second patch with regex in core soon too.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message