lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Doug Cutting <DCutt...@grandcentral.com>
Subject RE: context and hit positions with Lucene
Date Mon, 08 Oct 2001 17:25:46 GMT
> From: Lee Mallabone [mailto:lee@grantadesign.com]
> 
> > The
> > index does not store the byte-position of words in the 
> original document.
> 
> Does that rule out the potential to implement proximity 
> operators? I need to
> implement NEAR (and then SAME for paragraph searches), but 
> I'm a novice in
> terms of search engine implementations. Am I likely to be out 
> of my depth
> attempting that right now with Lucene?

Lucene does not directly support paragraph-based searching.

Lucene does support proximity searches, e.g., exact phrases, and within-N
words (slop).  Please see the documentation for PhraseQuery, especially the
setSlop(int) method:
 
http://jakarta.apache.org/lucene/api/org/apache/lucene/search/PhraseQuery.ht
ml

Phrase slop is thus essentially WITHIN.  The queryParser class does not yet
have a syntax to specify slop.

> > Perhaps we should add a utility method such as:
> >
> >   public static Set getHitTokens(Set queryTerms, Reader 
> text, Analyzer a)

> This looks good, but what about the (future) case where you 
> have complex
> (possibly nested) proximity searches and only want to 
> highlight the relevant
> tokens when they appear near each other?

As you point out, the method I suggest would highlight isolated occurrences
of terms from query phrases in hit documents, even when they do not occur in
phrases.  (Note that for the document to be a hit, they will somewhere also
occur together in a phrase, and possibly quite frequently for a high-scoring
hit.)  Google and most other search engines implement term highlighting this
way, and I think it is acceptable.  One could of course write a
TokenStream-based query evaluator that correctly interpreted phrasal
restrictions when highlighting.  Personally, I do not think it is worth the
effort, so I am not volunteering to do it myself.

Doug

Mime
View raw message