lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mark Miller <markrmil...@gmail.com>
Subject Re: Highlighter that works with phrase and span queries
Date Mon, 27 Aug 2007 11:21:37 GMT
I am a bit unclear about your question. The patch you mention extends 
the original Highlighter to support phrase and span queries. It does not 
include any major performance increases over the original Highlighter 
(in fact, it takes a bit longer to Highlight a Span or Phrase query than 
it does to just highlight Terms).

Will it be released with the next version of Lucene? Doesn't look like 
it, but anything is possible. A few people are using it, but there has 
not been widespread interest that I have seen. My guess is that there 
are just not enough people trying to highlight Span queries -- which I'd 
blame on a lack of Span support in the default Lucene Query syntax.

Whether it is included soon or not, the code works well and I will 
continue to support it.

- Mark

Michael Stoppelman wrote:
> Is this jar going to be in the next release of lucene? Also, are these the
> same as the changes in the following patch:
> https://issues.apache.org/jira/secure/attachment/12362653/spanhighlighter10.patch
>
> -M
>
> On 6/27/07, Mark Miller <markrmiller@gmail.com> wrote:
>   
>>     
>>> I have not looked at any highlighting code yet. Is there already an
>>>       
>> extension
>>     
>>> of PhraseQuery that has getSpans() ?
>>>
>>>       
>> Currently I am using this code originally by M. Harwood:
>>             Term[] phraseQueryTerms = ((PhraseQuery) query).getTerms();
>>             int i;
>>             SpanQuery[] clauses = new SpanQuery[phraseQueryTerms.length];
>>
>>             for (i = 0; i < phraseQueryTerms.length; i++) {
>>                 clauses[i] = new SpanTermQuery(phraseQueryTerms[i]);
>>             }
>>
>>             SpanNearQuery sp = new SpanNearQuery(clauses,
>>                     ((PhraseQuery) query).getSlop(), false);
>>             sp.setBoost(query.getBoost());
>>
>> I don't think it is perfect logic for PhraseQuery's edit distance, but
>> it approximates extremely well in most cases.
>>
>> I wonder if this approach to Highlighting would be worth it in the end.
>> Certainly, it would seem to require that you store offsets or you would
>> have to re-tokenize anyway.
>>
>> Some more interesting "stuff" on the current Highlighter methods:
>>
>> We can gain a lot of speed on the implementation of the current
>> Highlighter if we grab from the source text in bigger chunks. Ronnie's
>> Highlighter appears to be faster than the original due to two things: he
>> doesn't have to re-tokenize text and he rebuilds the original document
>> in large pieces. Depending on how you want to look at it, he loses most
>> of the speed gained from just looking at the Query tokens instead of all
>> tokens to pulling the Term offset information (which appears pretty slow).
>>
>> If you use a SimpleAnalyzer on docs around 1800 tokens long, you can
>> actually match the speed of Ronnies highlighter with the current
>> highlighter if you just rebuild the highlighted documents in bigger
>> pieces i.e. instead of going through each token and adding the source
>> text that it covers, build up the offset information until you get
>> another hit and then pull from the source text into the highlighted text
>> in one big piece rather than a tokens worth at a time. Of course this is
>> not compatible with the way the Fragmenter currently works. If you use
>> the StandardAnalyzer instead of SimpleAnalyzer, Ronnie's highlighter
>> wins because it takes so darn long to re-analyze.
>>
>> It is also interesting to note that it is very difficult to see in a
>> gain in using TokenSources to build a TokenStream. Using the
>> StandardAnalyzer, it takes docs that are 1800 tokens just to be as fast
>> as re-analyzing. Notice I didn't say fast, but "as fast". Anything
>> smaller, or if you're using a simpler analyzer, and TokenSources is
>> certainly not worth it. It just takes too long to pull TermVector info.
>>
>> - Mark
>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
>>     
>
>   

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message