lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Tatu Saloranta <>
Subject Re: Schreiber's TermHighlighter/LuceneTools utility and the Demo...
Date Tue, 04 Feb 2003 15:26:19 GMT
On Tuesday 04 February 2003 03:17, Massimo Mannino wrote:
> From: Stone, Timothy
>       Subject: Schreiber's TermHighlighter/LuceneTools utility and the
>       Demo...
>       Date: Wed, 11 Dec 2002 13:00:28 -0800
> Has anyone successfully implemented Mark Schreiber's TermHighlighter
> interface in the
> demo? I'm seeking some pointers/samples.

I was thinking of starting to implement it at some point, but haven't yet done 
it. However, reading through the sample code, I thought there are some things
that would be useful to add to Lucene Term class(es).

Most importantly, example traverses query hierarchy to get to terms used, by 
doing explicit casts and checks. This seems like a good place to use the
classic visitor pattern instead. So, should either have abstract
methods (to be implemented by deriving classes) to implement visitor
pattern (if I recall, something like "visitTerms(TermVisitor)" to start 
traversing,  and visit calls a method in TermVisitor for each term), or
alternatively (additionally), methods for getting Terms and sub-queries
contained. [my apologies if I description of visitor pattern is bit off... 
been a while since used it... but the main idea should be about right]

This would have to be included in core Lucene classes, but I think that would 
be a good (and simple to add) addition, to allow easily collecting all
Terms and/or sub-Queries given full query has.

After getting all terms, tokenizing should be easy, and one can create a data 
structure listing all spans that need highlighting. Then this has to be 
applied to the original un-tokenized content. Structure can (for example) 
just contain character ranges along with type of highlighting (different 
There are some potential tricky parts if HTML is to be highlighted; highlight 
markers still need to nest properly (if terms are not fully contained in same 
tag level), and some highlighted terms may not be visible (to resolve this, 
one needs to compare original text and analyzed text, to see which parts are 
removed by analyzer, but that match term(s)).
But even without getting it 100% correct, output should be fairly close.

-+ Tatu +-

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message