lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Paul Elschot <paul.elsc...@xs4all.nl>
Subject Surround query parser
Date Sun, 18 Apr 2004 12:51:45 GMT
Dear developers,

I'd like to contribute a query parser named Surround.

The implementation uses mostly Lucene's BooleanQuery, TermQuery,
SpanNearQuery, SpanOrQuery and SpanTermQuery. These are chosen
depending on the query operator.

Currently the sources are in a CVS working copy next to a lucene
working copy. There is some test code which uses the latest
lucene jar generated from the lucene working copy.

The source code has cooled down far enough for a
package restructuring. In case there is interest, how would
the sources best be structured? Currently two packages are
used the sources: org.surround.queryparser and
org.surround.search.
Following the name of org.apache.lucene.wordnet in the sandbox,
would org.apache.lucene.surround be ok.?


Regards,
Paul


P.S.:

Surround consists of these operators (uppercase/lowercase):

AND/OR/NOT/nW/nN/   as infix and
AND/OR/nW/nN        as prefix.

Distance operators W and N have default n=1, max 99.
Implemented as ordered/unordered SpanQuery with slop = (n - 1).
An example prefix form is:

20N(aa*, bb*, cc*)

The name Surround was chosen because of this prefix form
and because it uses the newly introduced span queries
to implement the proximity operators.

The operators and their prefix and infix
forms were borrowed from the user documentation of
various other query languages on the internet.

Wildcards/truncations are the same as in the
Lucene standard query parser:
* for internal and suffix truncation,
? to match one character.

And there is:
^ for boosting a term or a bracketed subquery.


Some examples (best read with fixed size font):

aa
aa and bb
aa and bb or cc        same effect as:  (aa and bb) or cc
aa NOT bb NOT cc       same effect as:  (aa NOT bb) NOT cc

and(aa,bb,cc)          aa and bb and cc
99w(aa,bb,cc)          ordered span query with slop 98
99n(aa,bb,cc)          unordered span query with slop 98

20n(aa*,bb*)
3w(a?a or bb?, cc*)    W subqueries: OR, truncation

title: text: aa
title : text : aa or bb
title:text: aa not bb
title:aa not text:bb

cc 3w dd               infix: dual.

cc N dd N ee           same effect as:   (cc N dd) N ee

text: aa 3n bb         same effect as:    text: (aa 3n bb)



Development status

Not tested: multiple fields, internally mapped to OR queries.

Suffix truncation is implemented very similar to Lucene's PrefixQuery.

Wildcards (? and internal *) are implemented with regular expressions
to allow further variations. A reimplementation using Lucene's
WildCardTermEnum (correct name?) should be no problem.

There is a warning for ordered subqueries with 3 or more subqueries,
due to a pending bug in the ordered SpanNearQuery.

Warnings about missing terms are sent to System.out, this might
be replaced by another stream.

There are no javadoc comments.
I'm using java 1.4.2, so probably there are some dependencies
on java 1.4.
Other tools used: ant 1.6b2 and javacc 3.2.
The build target javacc should be used explicitly
when the .jj file is changed.

The sources, apart from a build.xml file:

... src/java/org/surround/search> wc *.java ../q*/*.jj | sort -r

   1424    4322   40776 total
    436    1404   11140 ../queryparser/QueryParser.jj
    138     484    4582 SpanNearClauseFactory.java
    106     316    3359 DistanceQuery.java
    101     266    2860 ComposedQuery.java
     96     245    2480 SrndTruncQuery.java
     95     266    2994 SimpleTerm.java
     78     245    2390 FieldsQuery.java
     72     218    2044 SrndQuery.java
     60     151    1613 SrndPrefixQuery.java
     52     132    1378 SrndTermQuery.java
     49     158    1446 BasicQueryFactory.java
     46     130    1412 OrQuery.java
     31      80     826 SrndBooleanQuery.java
     22      79     866 NotQuery.java
     16      54     569 AndQuery.java
     15      59     512 DistanceSubQuery.java
     11      35     305 TooManyBasicQueries.java

And the test code:

... /src/test/org/surround/search> wc *.java | sort -r

    550    1963   16899 total
    203     875    6761 Test03Distance.java
    105     444    3582 Test02Boolean.java
     97     272    2805 BooleanQueryTest.java
     55     144    1528 ExceptionQueryTest.java
     51     121    1072 Test01Exceptions.java
     39     107    1151 SingleFieldTestDb.java


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org


Mime
View raw message