lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From David Spencer <>
Subject Re: amusing interaction between advanced tokenizers and highlighter package
Date Sat, 19 Jun 2004 16:35:46 GMT
Erik Hatcher wrote:

> On Jun 19, 2004, at 2:29 AM, David Spencer wrote:
>> A naive analyzer would turn something like "SyncThreadPool" into one 
>> token. Mine uses the great Lucene capability of Tokens being able to 
>> have a "0" position increment to turn it into the token stream:
>> Sync   (incr = 0)
>> Thread (incr = 0)
>> Pool (incr = 0)
>> SyncThreadPool (incr = 1)
>> [As an aside maybe it should also pair up the subtokens, so 
>> "SyncThread" and "ThreadPool" appear too].
>> The point behind this is someone searching for "threadpool" probably 
>> would want to see a match for "SyncThreadPool" even this is the evil 
>> leading-prefix case. With most other Analyzers and ways of forming a 
>> query this would be missed, which I think is anti-human and annoys me 
>> to no end.
> There are indexing/querying solutions/workarounds to the 
> leading-prefix issue, such as reversing the text as you index it and 
> ensuring you do the same on queries so they match.  There are some 
> interesting techniques for this type of thing in the Managing 
> Gigabytes book I'm currently reading, which Lucene could support with 
> custom analysis and queries, I believe.

Yeah, great book. I thought my approach fit into Lucene the most 
naturally for my goals - and no doubt, things like just having the 
possibility of different pos increments is a great concept that I 
haven't seen in other search engines. I keep meaning to try an idea that 
appeared on the list some months ago, bumping up the incr between 
sentences so that it's harders for, say, a 2 word phrase to match w/ 1 
word in each sentence (makes sense to a computer, but usually not what a 
human wants).  Another side project...

>> The problem is as follows. In all cases I use my Analyzer to index 
>> the documents.
>> If I use my Analyzer with with the Highligher package,  it doesn't 
>> look at the position increment of the tokens and consequently a 
>> nonsense stream of matches is output. If I use a different Analyzer 
>> w/ the highlighter (say, the StandardAnalyzer), then it doesn't show 
>> the matches that really matched, as it doesn't see the "subtokens".
> Are your "subtokens" marked with correct offset values?  This probably 
> doesn't relate to the problem you're seeing, but I'm curious.

I think so but this is the first time I've done this kind of thing. When 
I hit the special case several of the "subtokens" are 1st returned w/ an 
incr of 0, then the normal token, w/ an incr of 1 - which seems to make 
sense to me at least.

>> It might be the fix is for the Highlighter to look at the position 
>> increment of tokens and only pass by one if multiple ones have an 
>> incr of 0 and match one part of the query.
>> Has this come up before and is the issue clear?
> The problem is clear, and I've identified this issue with my 
> exploration of the Highlighter also.  The Highlighter works well for 
> the most common scenarios, but certainly doesn't cover all the bases.  
> The majority of scenarios do not use multiple tokens in a single 
> position.  Also, it also doesn't currently handle the new SpanQuery 
> family - although Highlighting spans would be quite cool.  After 
> learning how Highlighter works, I have a deep appreciation for the 
> great work Mark put into it - it is well done.
> As for this issue, though, I think your solution sounds reasonable, 
> although I haven't thought it through completely.  Perhaps Mark can 
> comment.  If you do modify it to work for your case, it

Oh sure, I'll post any changes but wait for Mark for now.

> would be great to have your contribution rolled back in :)
>     Erik
> ---------------------------------------------------------------------
> To unsubscribe, e-mail:
> For additional commands, e-mail:

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message