lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From David Spencer <dave-lucene-u...@tropo.com>
Subject amusing interaction between advanced tokenizers and highlighter package
Date Sat, 19 Jun 2004 06:29:52 GMT
I've run across an amusing interaction between advanced 
Analyzers/TokenStreams and the very useful "term highlighter": 
http://cvs.apache.org/viewcvs/jakarta-lucene-sandbox/contributions/highlighter/

I have a custom Analyzer I'm using to index javadoc-generated web pages.
The Analyzer in turn has a custom TokenStream which tries to more 
intelligently tokenize java-language tokens.

A naive analyzer would turn something like "SyncThreadPool" into one 
token. Mine uses the great Lucene capability of Tokens being able to 
have a "0" position increment to turn it into the token stream:

Sync   (incr = 0)
Thread (incr = 0)
Pool (incr = 0)
SyncThreadPool (incr = 1)

[As an aside maybe it should also pair up the subtokens, so "SyncThread" 
and "ThreadPool" appear too].

The point behind this is someone searching for "threadpool" probably 
would want to see a match for "SyncThreadPool" even this is the evil 
leading-prefix case. With most other Analyzers and ways of forming a 
query this would be missed, which I think is anti-human and annoys me to 
no end.

So the analyzer/tokenizer works great, and I have a demo site about to 
come up that indexes lots of publicly avail javadoc as a kind of 
resource so you can easily find what's already been done.

The problem is as follows. In all cases I use my Analyzer to index the 
documents.
If I use my Analyzer with with the Highligher package,  it doesn't look 
at the position increment of the tokens and consequently a nonsense 
stream of matches is output. If I use a different Analyzer w/ the 
highlighter (say, the StandardAnalyzer), then it doesn't show the 
matches that really matched, as it doesn't see the "subtokens".

It might be the fix is for the Highlighter to look at the position 
increment of tokens and only pass by one if multiple ones have an incr 
of 0 and match one part of the query.

Has this come up before and is the issue clear?


thx,
 Dave

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Mime
View raw message