lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mark Miller <>
Subject Re: Highlighter that works with phrase and span queries
Date Wed, 29 Aug 2007 11:58:16 GMT
It kind of is a contrib -- its really just a new Scorer class (with some 
axillary helper classes) for the old contrib Highlighter. Since the 
contrib Highlighter is pretty hardened at this point, I figured that was 
the best way to go. Or do you mean something different?

- Mark

Mike Klaas wrote:
> Mark,
> I'm still interested in integrating this into Solr--this is a feature 
> that has been requested a few times.  It would be easier to do so if 
> it were a contrib/...
> thanks for the great work,
> -Mike
> On 27-Aug-07, at 4:21 AM, Mark Miller wrote:
>> I am a bit unclear about your question. The patch you mention extends 
>> the original Highlighter to support phrase and span queries. It does 
>> not include any major performance increases over the original 
>> Highlighter (in fact, it takes a bit longer to Highlight a Span or 
>> Phrase query than it does to just highlight Terms).
>> Will it be released with the next version of Lucene? Doesn't look 
>> like it, but anything is possible. A few people are using it, but 
>> there has not been widespread interest that I have seen. My guess is 
>> that there are just not enough people trying to highlight Span 
>> queries -- which I'd blame on a lack of Span support in the default 
>> Lucene Query syntax.
>> Whether it is included soon or not, the code works well and I will 
>> continue to support it.
>> - Mark
>> Michael Stoppelman wrote:
>>> Is this jar going to be in the next release of lucene? Also, are 
>>> these the
>>> same as the changes in the following patch:

>>> -M
>>> On 6/27/07, Mark Miller <> wrote:
>>>>> I have not looked at any highlighting code yet. Is there already an
>>>> extension
>>>>> of PhraseQuery that has getSpans() ?
>>>> Currently I am using this code originally by M. Harwood:
>>>>             Term[] phraseQueryTerms = ((PhraseQuery) 
>>>> query).getTerms();
>>>>             int i;
>>>>             SpanQuery[] clauses = new 
>>>> SpanQuery[phraseQueryTerms.length];
>>>>             for (i = 0; i < phraseQueryTerms.length; i++) {
>>>>                 clauses[i] = new SpanTermQuery(phraseQueryTerms[i]);
>>>>             }
>>>>             SpanNearQuery sp = new SpanNearQuery(clauses,
>>>>                     ((PhraseQuery) query).getSlop(), false);
>>>>             sp.setBoost(query.getBoost());
>>>> I don't think it is perfect logic for PhraseQuery's edit distance, but
>>>> it approximates extremely well in most cases.
>>>> I wonder if this approach to Highlighting would be worth it in the 
>>>> end.
>>>> Certainly, it would seem to require that you store offsets or you 
>>>> would
>>>> have to re-tokenize anyway.
>>>> Some more interesting "stuff" on the current Highlighter methods:
>>>> We can gain a lot of speed on the implementation of the current
>>>> Highlighter if we grab from the source text in bigger chunks. Ronnie's
>>>> Highlighter appears to be faster than the original due to two 
>>>> things: he
>>>> doesn't have to re-tokenize text and he rebuilds the original document
>>>> in large pieces. Depending on how you want to look at it, he loses 
>>>> most
>>>> of the speed gained from just looking at the Query tokens instead 
>>>> of all
>>>> tokens to pulling the Term offset information (which appears pretty 
>>>> slow).
>>>> If you use a SimpleAnalyzer on docs around 1800 tokens long, you can
>>>> actually match the speed of Ronnies highlighter with the current
>>>> highlighter if you just rebuild the highlighted documents in bigger
>>>> pieces i.e. instead of going through each token and adding the source
>>>> text that it covers, build up the offset information until you get
>>>> another hit and then pull from the source text into the highlighted 
>>>> text
>>>> in one big piece rather than a tokens worth at a time. Of course 
>>>> this is
>>>> not compatible with the way the Fragmenter currently works. If you use
>>>> the StandardAnalyzer instead of SimpleAnalyzer, Ronnie's highlighter
>>>> wins because it takes so darn long to re-analyze.
>>>> It is also interesting to note that it is very difficult to see in a
>>>> gain in using TokenSources to build a TokenStream. Using the
>>>> StandardAnalyzer, it takes docs that are 1800 tokens just to be as 
>>>> fast
>>>> as re-analyzing. Notice I didn't say fast, but "as fast". Anything
>>>> smaller, or if you're using a simpler analyzer, and TokenSources is
>>>> certainly not worth it. It just takes too long to pull TermVector 
>>>> info.
>>>> - Mark
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail:
>>>> For additional commands, e-mail:
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail:
>> For additional commands, e-mail:
> ---------------------------------------------------------------------
> To unsubscribe, e-mail:
> For additional commands, e-mail:

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message