lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Roman Chyla (JIRA)" <j...@apache.org>
Subject [jira] [Created] (LUCENE-5014) ANTLR Lucene query parser
Date Thu, 23 May 2013 00:11:20 GMT
Roman Chyla created LUCENE-5014:
-----------------------------------

             Summary: ANTLR Lucene query parser
                 Key: LUCENE-5014
                 URL: https://issues.apache.org/jira/browse/LUCENE-5014
             Project: Lucene - Core
          Issue Type: Improvement
          Components: core/queryparser, modules/queryparser
    Affects Versions: 4.3
         Environment: all
            Reporter: Roman Chyla


I would like to propose a new way of building query parsers for Lucene.  Currently, most Lucene
parsers are hard to extend because they are either written in Java (ie. the SOLR query parser,
or edismax) or the parsing logic is 'married' with the query building logic (i.e. the standard
lucene parser, generated by JavaCC) - which makes any extension really hard.


Few years back, Lucene got the contrib/modern query parser (later renamed to 'flexible'),
yet that parser didn't become a star (it must be very confusing for many users). However,
that parsing framework is very powerful! And it is a real pity that there aren't more parsers
already using it - because it allows us to add/extend/change almost any aspect of the query
parsing. 

So, if we combine ANTLR + queryparser.flexible, we can get very powerful framework for building
almost any query language one can think of. And I hope this extension can become useful.

The details:

 - every new query syntax is written in EBNF, it lives in separate files (and can be tested/developed
independently - using 'gunit')
 - ANTLR parser generates parsing code (and it can generate parsers in several languages,
the main target is Java, but it can also do Python - which may be interesting for pylucene)
 - the parser generates AST (abstract syntax tree) which is consumed by a  'pipeline' of processors,
users can easily modify this pipeline to add a desired functionality
 - the new parser contains a few (very important) debugging functions; it can print results
of every stage of the build, generate AST's as graphical charts; ant targets help to build/test/debug
grammars
 - I've tried to reuse the existing queryparser.flexible components as much as possible, only
adding new processors when necessary

Assumptions about the grammar:
 - every grammar must have one top parse rule called 'mainQ'
 - parsers must generate AST (Abstract Syntax Tree)

The structure of the AST is left open, there are components which make assumptions about the
shape of the AST (ie. that MODIFIER is parent of a a FIELD) however users are free to choose/write
different processors with different assumptions about the AST shape.



More documentation on how to use the parser can be seen here:

http://29min.wordpress.com/category/antlrqueryparser/


The parser has been created more than one year back and is used in production (http://labs.adsabs.harvard.edu/adsabs/).
A different dialects of query languages (with proximity operatos, functions, special logic
etc) - can be seen here: 

https://github.com/romanchyla/montysolr/tree/master/contrib/adsabs
https://github.com/romanchyla/montysolr/tree/master/contrib/invenio




--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message