lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Daniel Noll <>
Subject Re: Highlighting text for queries with huge numbers of terms
Date Fri, 17 Feb 2006 05:11:48 GMT
Chris Hostetter wrote:
> if you build a map whose keys are tokens which begin token lists for
> queries, each of which is is mapped to a value which is a list of lists of
> tokens, then you can make one pass over the tokens from the main text, and
> "lookup" wether or not this is the potential start of something to
> highlight, and if it is then check the rest of the tokens.
> that proabbly didn't make a lot of sense did it?

Actually, that (and the following examples and pseudocode) puts into 
words roughly what I've been churning over in my mind for a while now.

Ignoring the tokenisation which can largely be done outside of the 
search algorithm, what I seem to have here is the need to for a list 
utility method which would look something like...

     <T> Iterable<Location> findAll(List<T> bigList,
                                    List<List<T>> smallLists)

> priority to longer or shorter matches *with the same start token* can be
> done by carefully ordering the lists when you put them in the map ... but
> if you don't want overlapping highlighting, and you want to give priority
> to longer/shorter matches (instead of just the "first" match) then i think
> you really have to do at least two passes ... in the first pass you
> anotate each token with it's highlight phrase, and alow tokens to be in
> multiple phrase; in the second pass you look for highlighted phrases
> containing tokens which are in other highlighted phrases you that are
> "better" and undo that highlighting.

Yeah.  Actually, collisions like this are already a minor issue with our 
existing routine.  The interesting thing is that if I do highlight 
something twice using the Swing APIs, it doesn't do anything terrible to 
the display so I don't have to worry so much about it.

With HTML, of course, yes... I'd definitely need to filter to avoid 
overlapping tags.

I will probably go with the simple HashMap<String,List<Token>> approach 
first and see where that leaves me performance-wise.  I think that 
multiple multi-phrase queries should be relatively uncommon, so perhaps 
the multi-level approach won't even be necessary.


Daniel Noll

Nuix Australia Pty Ltd
Suite 79, 89 Jones St, Ultimo NSW 2007, Australia
Phone: (02) 9280 0699
Fax:   (02) 9212 6902

This message is intended only for the named recipient. If you are not
the intended recipient you are notified that disclosing, copying,
distributing or taking any action in reliance on the contents of this
message or attachment is strictly prohibited.

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message