lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From mark harwood <markharw...@yahoo.co.uk>
Subject Re: Highlighter that works with phrase and span queries
Date Thu, 21 Jun 2007 17:27:52 GMT
While we're considering highlighter performance there was some discussion of this around another
implementation here: http://issues.apache.org/jira/browse/LUCENE-644

Ronnie Kolehmainen's implementation was proven faster than the current contrib highlighter
but was almost certainly missing some of the features/support for edge cases.

There are certainly some optimisations in the existing implementation that should be possible.
Not building StringBuffers for document fragments with no hits seems an obvious step. Whether
this can be done while preserving the existing "helper" interfaces (Fragmenter/Scorer) remains
to be seen.

Cheers,
Mark


----- Original Message ----
From: Mark Miller <markrmiller@gmail.com>
To: java-user@lucene.apache.org
Sent: Thursday, 21 June, 2007 2:11:52 AM
Subject: Re: Highlighter that works with phrase and span queries

I will work up some performance numbers over the next day or two to 
share with you. I have spent the last day or two with a profiler trying 
to find the biggest performance drains.

Unfortunately, I will probably not be able to squeeze out much more 
performance than the current Highlighter. When I started working on this 
project I considered starting from scratch to create a better, more 
accurate Highlighter. After some initial work I quickly came to the 
realization that Mark Harwood (with some additions by others) had 
already solved too many corner cases and interesting needs. The few 
alternate Highlighters in JIRA did not meet the standards set by Mark's 
highlighter. Trying to replicate all that work in a different manner 
didn't seem like a fruitful approach -- Harwood is more clever than I <g>

Taking that into account, I decided to extend the Highlighter using the 
great framework that is already in place. I implemented a new Scorer 
that acts much like the default Scorer, but when it finds a Query clause 
that is position sensitive (PhraseQuery, SpanQuery), it creates a 
MemoryIndex that is used extract the correct Spans for the Query (Credit 
to Paul Elschot and Mark Harwood for the approach). Non position 
sensitive Query claueses are handled similar to the way they where in 
the original highlighter's Scorer. This means that non position 
sensitive queries are likely the same speed as before, while position 
sensitive queries are likely a bit slower. For my uses, the thing is 
damned fast -- of course my uses involves small documents (Newspaper 
articles).

I am very interested in making this thing as fast as possible though, so 
I will build some benchmark tests and try to squeeze as much performance 
out of the Highligher as I can. I will also see if my Scorer is any 
faster than the original.

All that said, my guess is that one of the slowest parts of Highlighting 
is re-tokenizing the text. There is always the option of turning on 
TermVectors and using org.apache.lucene.search.highlight.TokenSources to 
get the TokenStream. Based on Mark H's comments, it may be twice as fast 
as re-tokenizing. This method can also be used with my new Highlighter 
code as well (which is more a plug-in to the old Highlighter than a 
replacement.)

Considering that both of your comments immediately went to performance, 
I will certainly be spending some time working on this.

- Mark

> Hi Mark,
>
> I know one large user (meaning: high query/highlight rates) of the current Highlighter
and this user wasn't too happy with its performance.  I don't know the details, other than
it was inefficient.  So now I'm wondering if you've benchmarked your Highlighter against that/current
Highlighter to see not only which one is more accurate, but also which one is faster, and
by how much?
>
> Thanks,
> Otis
>  . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
> Simpy -- http://www.simpy.com/  -  Tag  -  Search  -  Share
>   

> This is really great, Mark.  I'll look into integrating it with Solr,
> as better phrase highlighting is a definite sore point for some of our
> users.
>
>
>
> Any indication on performance differences?
>
>
>
> cheers,
>
> -mike
>
>
>   

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org






      ___________________________________________________________ 
Yahoo! Mail is the world's favourite email. Don't settle for less, sign up for
your free account today http://uk.rd.yahoo.com/evt=44106/*http://uk.docs.yahoo.com/mail/winter07.html


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message