lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chris Hostetter <>
Subject Re: Sentence boundary storage
Date Sat, 29 Oct 2005 00:08:49 GMT

: One thing that I know has bogged me is when matching a phrase where I
: would expect mathematical formula (which is "just a subphrase"). I
: would have liked the phrase-query to extend as far as it wishes but not
: passed a given token... would this be possible ?
: Presumably a period token and this feature would have provided the same?

I haven't tried it myself, but my reading of SpanQueries leads me to
believe you could accomplish what you want (and what Grant describes) by
inserting special Terms to denote
formula/sentance/paragraph/section/chapter boundaries, and then use
SpanNearQueries with a high slop in conjunction with a
SpanNotQuery using a SpanTermQuery for the boundary you don't want to

(or a SpanOrQuery containing many SpanTermQueries for the list of
boundaries you don't want to cross).

If you get your Tokenizer to put the special boundary terms at the exact
same position as the token it marks, regular PhraseQueries should still
work fine without needing any special slop, and you could do stuff like
say "find me this phrase near the begining of a sentence". Or "find me
this phrase near the end of a chapter"

: > Was wondering what people's experience is with storing sentence (or
: > other) boundary information in Lucene.  For instance, for phrase
: > queries, you may not want to match when two terms lie on either side
: > of a sentence boundary.  I know for phrase queries the common approach
: > is to make the position increment larger than one, which solves that
: > immediate problem, but I have other uses for such information, too.
: > Should I just store some type of boundary marker at the appropriate
: > position and check to see if I have a boundary marker when doing my
: > processing?  I know I need an Analyzer that can detect the boundaries,
: > for starters.  What other issues have people run up against?


To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message