lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Erik Hatcher <>
Subject Re: amusing interaction between advanced tokenizers and highlighter package
Date Sat, 19 Jun 2004 09:16:37 GMT
On Jun 19, 2004, at 2:29 AM, David Spencer wrote:
> A naive analyzer would turn something like "SyncThreadPool" into one 
> token. Mine uses the great Lucene capability of Tokens being able to 
> have a "0" position increment to turn it into the token stream:
> Sync   (incr = 0)
> Thread (incr = 0)
> Pool (incr = 0)
> SyncThreadPool (incr = 1)
> [As an aside maybe it should also pair up the subtokens, so 
> "SyncThread" and "ThreadPool" appear too].
> The point behind this is someone searching for "threadpool" probably 
> would want to see a match for "SyncThreadPool" even this is the evil 
> leading-prefix case. With most other Analyzers and ways of forming a 
> query this would be missed, which I think is anti-human and annoys me 
> to no end.

There are indexing/querying solutions/workarounds to the leading-prefix 
issue, such as reversing the text as you index it and ensuring you do 
the same on queries so they match.  There are some interesting 
techniques for this type of thing in the Managing Gigabytes book I'm 
currently reading, which Lucene could support with custom analysis and 
queries, I believe.

> The problem is as follows. In all cases I use my Analyzer to index the 
> documents.
> If I use my Analyzer with with the Highligher package,  it doesn't 
> look at the position increment of the tokens and consequently a 
> nonsense stream of matches is output. If I use a different Analyzer w/ 
> the highlighter (say, the StandardAnalyzer), then it doesn't show the 
> matches that really matched, as it doesn't see the "subtokens".

Are your "subtokens" marked with correct offset values?  This probably 
doesn't relate to the problem you're seeing, but I'm curious.

> It might be the fix is for the Highlighter to look at the position 
> increment of tokens and only pass by one if multiple ones have an incr 
> of 0 and match one part of the query.
> Has this come up before and is the issue clear?

The problem is clear, and I've identified this issue with my 
exploration of the Highlighter also.  The Highlighter works well for 
the most common scenarios, but certainly doesn't cover all the bases.  
The majority of scenarios do not use multiple tokens in a single 
position.  Also, it also doesn't currently handle the new SpanQuery 
family - although Highlighting spans would be quite cool.  After 
learning how Highlighter works, I have a deep appreciation for the 
great work Mark put into it - it is well done.

As for this issue, though, I think your solution sounds reasonable, 
although I haven't thought it through completely.  Perhaps Mark can 
comment.  If you do modify it to work for your case, it would be great 
to have your contribution rolled back in :)


To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message