lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Mark Miller" <markrmil...@gmail.com>
Subject Re: Test new query parser?
Date Mon, 21 Aug 2006 20:37:47 GMT
Great, I will get something ready to be given out within a day or so then.

Paragraph/Sent prox support is one thing I really need to test and improve.

The parapraph and sentence search uses a SpanWithinQuery. This is just a
SpanNotQuery that can span a specified number of times instead of not at
all.

mark ~3p dopey

SpanWithinQuery(spanNear([allFields:mark, allFields:dopey], 99999, false),
3, allFields:¶)

>
> mark ord~3p dopey

SpanWithinQuery(spanNear([allFields:mark, allFields:dopey], 99999, true), 3,
allFields:¶)

I use 99999 terribly arbitrarily. Integer.max_value blows things up.

This example uses ¶ as a marker. I think a better default might be a double
newline of some kind. Unfortunately, the marker is not nicely configurable
without source code access because it must be defined in the .jj file of the
modified standard analyzer. Sentences can be deduced more universally:

Sentences are matched with:

<SENTENCE: ([".","?","!","]","[","]","\"","'",")"])+ ([" ","\t","\r","\n"])+
(["[","\"","'","`","("])* >

which is seems to be the standard sentence finder, although it lacks the
ability to check that an alphanumeric comes next (so that it doesn't mark a
sentence at the end of the text). No biggie for now though.

Also, right now paragraphs and sentences are marked separately at all
spots...for compactness a paragraph marker should also represent a sentence
marker. I need to check speed on these things though.

With some others interested in this I will get a move on these issues
though. I really want some feedback to make this thing better. It is not
perfect but I believe it has potential. Adding the other span type query for
example, would not be very difficult. The syntax is pretty easily modified
and extended

- Mark

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message