lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Daniel Noll <dan...@nuix.com.au>
Subject Re: Highlighting text for queries with huge numbers of terms
Date Sun, 19 Feb 2006 23:31:02 GMT
markharw00d wrote:
> Swing supports HTML and will do the highlight for you.
> SwingText="<html>"+highlighter.getBestFragment(tokenStream,text)+"</html>";
> 
> If you don't like that approach and really do just want to just know the 
> positions, plug in your own "Formatter" class which, instead of marking 
> up the text, silently records the hit position information provided to 
> it in the "TokenGroup" class and then return the original string without 
> adding any markup. TokenGroup handles the issue of identifying runs of 
> overlapping tokens for you.

Swing's HTML renderer is unfortunately too slow for our use (it took 
something like 10 seconds to load and display a 100kB document with 
highlights.)  It's pretty ugly, too.  Maybe that will change in version 
6.0, though.

The text renderer has a distinct advantage of being relatively fast for 
that size, but also the highlighting can be done after the text is 
displayed and even in the background, which is a huge benefit.

The way I would be able to use the existing highlighter would probably 
be to make a custom Formatter which takes a Swing Highlighter object and 
does the highlighting from there, and then run the highlighter in a new 
thread after the text is already displayed.

But...

> Hoss, your psuedo code looked like a solution for identifying phrase 
> queries.  Lack of proper support for phrase queries is a known issue 
> with the current highlighter but I thought the primary issue in question 
> here was speed?

Actually, we do need support for phrase queries (which pretty much rules 
out the existing highlighter code) but slop isn't as important.

> Not sure this helps for non-phrase queries.

Indeed, the speed will be roughly identical for non-phrase queries since 
lookups in a HashMap versus a HashSet would be pretty much identical.

> Also, I don't think hitting the index to work out 
> what terms were a hit for the doc in question in order to shorten the 
> list of terms to highlight  is likely to speed up things. If anything, 
> the extra disk IO is likely to slow it down.

That's a good point, particularly in the case of small documents.

We actually limit ourselves to 100k display at the moment because even 
JTextArea gets pretty inefficient once you get up to 1MB text sizes. 
The nasty part is that it can't load the text itself in the background 
-- the text component remains completely white until the entire text is 
present.  A custom Document implementation may be able to work around 
some of that, but the last time I tried it, there was some flicker each 
time more text was appended.  A completely custom component is probably 
the way to go. :-)

Daniel


-- 
Daniel Noll

Nuix Australia Pty Ltd
Suite 79, 89 Jones St, Ultimo NSW 2007, Australia
Phone: (02) 9280 0699
Fax:   (02) 9212 6902

This message is intended only for the named recipient. If you are not
the intended recipient you are notified that disclosing, copying,
distributing or taking any action in reliance on the contents of this
message or attachment is strictly prohibited.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message