lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mike Klaas <>
Subject Re: Highlighter that works with phrase and span queries
Date Mon, 27 Aug 2007 17:30:13 GMT

I'm still interested in integrating this into Solr--this is a feature  
that has been requested a few times.  It would be easier to do so if  
it were a contrib/...

thanks for the great work,

On 27-Aug-07, at 4:21 AM, Mark Miller wrote:

> I am a bit unclear about your question. The patch you mention  
> extends the original Highlighter to support phrase and span  
> queries. It does not include any major performance increases over  
> the original Highlighter (in fact, it takes a bit longer to  
> Highlight a Span or Phrase query than it does to just highlight  
> Terms).
> Will it be released with the next version of Lucene? Doesn't look  
> like it, but anything is possible. A few people are using it, but  
> there has not been widespread interest that I have seen. My guess  
> is that there are just not enough people trying to highlight Span  
> queries -- which I'd blame on a lack of Span support in the default  
> Lucene Query syntax.
> Whether it is included soon or not, the code works well and I will  
> continue to support it.
> - Mark
> Michael Stoppelman wrote:
>> Is this jar going to be in the next release of lucene? Also, are  
>> these the
>> same as the changes in the following patch:
>> spanhighlighter10.patch
>> -M
>> On 6/27/07, Mark Miller <> wrote:
>>>> I have not looked at any highlighting code yet. Is there already an
>>> extension
>>>> of PhraseQuery that has getSpans() ?
>>> Currently I am using this code originally by M. Harwood:
>>>             Term[] phraseQueryTerms = ((PhraseQuery)  
>>> query).getTerms();
>>>             int i;
>>>             SpanQuery[] clauses = new SpanQuery 
>>> [phraseQueryTerms.length];
>>>             for (i = 0; i < phraseQueryTerms.length; i++) {
>>>                 clauses[i] = new SpanTermQuery(phraseQueryTerms[i]);
>>>             }
>>>             SpanNearQuery sp = new SpanNearQuery(clauses,
>>>                     ((PhraseQuery) query).getSlop(), false);
>>>             sp.setBoost(query.getBoost());
>>> I don't think it is perfect logic for PhraseQuery's edit  
>>> distance, but
>>> it approximates extremely well in most cases.
>>> I wonder if this approach to Highlighting would be worth it in  
>>> the end.
>>> Certainly, it would seem to require that you store offsets or you  
>>> would
>>> have to re-tokenize anyway.
>>> Some more interesting "stuff" on the current Highlighter methods:
>>> We can gain a lot of speed on the implementation of the current
>>> Highlighter if we grab from the source text in bigger chunks.  
>>> Ronnie's
>>> Highlighter appears to be faster than the original due to two  
>>> things: he
>>> doesn't have to re-tokenize text and he rebuilds the original  
>>> document
>>> in large pieces. Depending on how you want to look at it, he  
>>> loses most
>>> of the speed gained from just looking at the Query tokens instead  
>>> of all
>>> tokens to pulling the Term offset information (which appears  
>>> pretty slow).
>>> If you use a SimpleAnalyzer on docs around 1800 tokens long, you can
>>> actually match the speed of Ronnies highlighter with the current
>>> highlighter if you just rebuild the highlighted documents in bigger
>>> pieces i.e. instead of going through each token and adding the  
>>> source
>>> text that it covers, build up the offset information until you get
>>> another hit and then pull from the source text into the  
>>> highlighted text
>>> in one big piece rather than a tokens worth at a time. Of course  
>>> this is
>>> not compatible with the way the Fragmenter currently works. If  
>>> you use
>>> the StandardAnalyzer instead of SimpleAnalyzer, Ronnie's highlighter
>>> wins because it takes so darn long to re-analyze.
>>> It is also interesting to note that it is very difficult to see in a
>>> gain in using TokenSources to build a TokenStream. Using the
>>> StandardAnalyzer, it takes docs that are 1800 tokens just to be  
>>> as fast
>>> as re-analyzing. Notice I didn't say fast, but "as fast". Anything
>>> smaller, or if you're using a simpler analyzer, and TokenSources is
>>> certainly not worth it. It just takes too long to pull TermVector  
>>> info.
>>> - Mark
>>> -------------------------------------------------------------------- 
>>> -
>>> To unsubscribe, e-mail:
>>> For additional commands, e-mail:
> ---------------------------------------------------------------------
> To unsubscribe, e-mail:
> For additional commands, e-mail:

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message