lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Daniel Noll <dan...@nuix.com.au>
Subject Highlighting text for queries with huge numbers of terms
Date Fri, 17 Feb 2006 04:16:46 GMT
Hi all.

I've just implemented some magic query syntax which expands simple 
queries to queries containing a whole lists of words.

I've implemented the queries themselves using a slight modification on 
the theme of QueryFilter (MultiQueryFilter, runs all queries to mark a 
single bitset, much faster than applying a logical OR to many 
QueryFilter bitsets and much lower memory than using a single 
QueryFilter wrapped around an enormous BooleanQuery.)

The queries are nice and fast, but now it occurs to me that I probably 
should highlight the text resulting from the wordlist.

Unfortunately, the contrib/highlighter code in source control fails to 
meet our needs in two ways:

   1. We don't just want fragments, we want *all* of the text, with
      highlights in the appropriate places (although we do offer a means
      to display just the fragments as well), and

   2. We don't deal with HTML, just plain text on a Swing text component.
      In other words we don't have to "format" or modify the text at all,
      just tell the Swing component which bits need to be highlighted.

The existing highlighting code we wrote basically works like this...

   1. Get the text out of the Swing component.

   2. Break the text into tokens using the appropriate Analyzer.

   3. For each term:
       3.1. Break the term into tokens using the same Analyzer.
       3.2. Iterate through the list of text tokens looking for the list
            of term tokens (basically find a sublist in a list.)

This has served us well so far, but for enormous numbers of terms it 
starts to get quite slow.

Is there a better approach for highlighting for a large number of terms? 
  For instance, it might be good to skip some terms if I can figure out 
that they're not in the document without spending too much time, and it 
might also be good to do all the token searches in a single pass, but 
I'm not entirely sure how to go about that.

Daniel


-- 
Daniel Noll

Nuix Australia Pty Ltd
Suite 79, 89 Jones St, Ultimo NSW 2007, Australia
Phone: (02) 9280 0699
Fax:   (02) 9212 6902

This message is intended only for the named recipient. If you are not
the intended recipient you are notified that disclosing, copying,
distributing or taking any action in reliance on the contents of this
message or attachment is strictly prohibited.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message