lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mike Klaas <mike.kl...@gmail.com>
Subject Re: Highlighter that works with phrase and span queries
Date Wed, 29 Aug 2007 17:28:38 GMT
I just meant whether it would live in a lucene release (somewhere  
under contrib/) or just in JIRA.  Would including the functionality  
in Solr help get it into lucene?

-Mike

On 29-Aug-07, at 4:58 AM, Mark Miller wrote:

> It kind of is a contrib -- its really just a new Scorer class (with  
> some axillary helper classes) for the old contrib Highlighter.  
> Since the contrib Highlighter is pretty hardened at this point, I  
> figured that was the best way to go. Or do you mean something  
> different?
>
> - Mark
>
> Mike Klaas wrote:
>> Mark,
>>
>> I'm still interested in integrating this into Solr--this is a  
>> feature that has been requested a few times.  It would be easier  
>> to do so if it were a contrib/...
>>
>> thanks for the great work,
>> -Mike
>>
>> On 27-Aug-07, at 4:21 AM, Mark Miller wrote:
>>
>>> I am a bit unclear about your question. The patch you mention  
>>> extends the original Highlighter to support phrase and span  
>>> queries. It does not include any major performance increases over  
>>> the original Highlighter (in fact, it takes a bit longer to  
>>> Highlight a Span or Phrase query than it does to just highlight  
>>> Terms).
>>>
>>> Will it be released with the next version of Lucene? Doesn't look  
>>> like it, but anything is possible. A few people are using it, but  
>>> there has not been widespread interest that I have seen. My guess  
>>> is that there are just not enough people trying to highlight Span  
>>> queries -- which I'd blame on a lack of Span support in the  
>>> default Lucene Query syntax.
>>>
>>> Whether it is included soon or not, the code works well and I  
>>> will continue to support it.
>>>
>>> - Mark
>>>
>>> Michael Stoppelman wrote:
>>>> Is this jar going to be in the next release of lucene? Also, are  
>>>> these the
>>>> same as the changes in the following patch:
>>>> https://issues.apache.org/jira/secure/attachment/12362653/ 
>>>> spanhighlighter10.patch
>>>>
>>>> -M
>>>>
>>>> On 6/27/07, Mark Miller <markrmiller@gmail.com> wrote:
>>>>
>>>>>
>>>>>> I have not looked at any highlighting code yet. Is there  
>>>>>> already an
>>>>>>
>>>>> extension
>>>>>
>>>>>> of PhraseQuery that has getSpans() ?
>>>>>>
>>>>>>
>>>>> Currently I am using this code originally by M. Harwood:
>>>>>             Term[] phraseQueryTerms = ((PhraseQuery)  
>>>>> query).getTerms();
>>>>>             int i;
>>>>>             SpanQuery[] clauses = new SpanQuery 
>>>>> [phraseQueryTerms.length];
>>>>>
>>>>>             for (i = 0; i < phraseQueryTerms.length; i++) {
>>>>>                 clauses[i] = new SpanTermQuery(phraseQueryTerms 
>>>>> [i]);
>>>>>             }
>>>>>
>>>>>             SpanNearQuery sp = new SpanNearQuery(clauses,
>>>>>                     ((PhraseQuery) query).getSlop(), false);
>>>>>             sp.setBoost(query.getBoost());
>>>>>
>>>>> I don't think it is perfect logic for PhraseQuery's edit  
>>>>> distance, but
>>>>> it approximates extremely well in most cases.
>>>>>
>>>>> I wonder if this approach to Highlighting would be worth it in  
>>>>> the end.
>>>>> Certainly, it would seem to require that you store offsets or  
>>>>> you would
>>>>> have to re-tokenize anyway.
>>>>>
>>>>> Some more interesting "stuff" on the current Highlighter methods:
>>>>>
>>>>> We can gain a lot of speed on the implementation of the current
>>>>> Highlighter if we grab from the source text in bigger chunks.  
>>>>> Ronnie's
>>>>> Highlighter appears to be faster than the original due to two  
>>>>> things: he
>>>>> doesn't have to re-tokenize text and he rebuilds the original  
>>>>> document
>>>>> in large pieces. Depending on how you want to look at it, he  
>>>>> loses most
>>>>> of the speed gained from just looking at the Query tokens  
>>>>> instead of all
>>>>> tokens to pulling the Term offset information (which appears  
>>>>> pretty slow).
>>>>>
>>>>> If you use a SimpleAnalyzer on docs around 1800 tokens long,  
>>>>> you can
>>>>> actually match the speed of Ronnies highlighter with the current
>>>>> highlighter if you just rebuild the highlighted documents in  
>>>>> bigger
>>>>> pieces i.e. instead of going through each token and adding the  
>>>>> source
>>>>> text that it covers, build up the offset information until you get
>>>>> another hit and then pull from the source text into the  
>>>>> highlighted text
>>>>> in one big piece rather than a tokens worth at a time. Of  
>>>>> course this is
>>>>> not compatible with the way the Fragmenter currently works. If  
>>>>> you use
>>>>> the StandardAnalyzer instead of SimpleAnalyzer, Ronnie's  
>>>>> highlighter
>>>>> wins because it takes so darn long to re-analyze.
>>>>>
>>>>> It is also interesting to note that it is very difficult to see  
>>>>> in a
>>>>> gain in using TokenSources to build a TokenStream. Using the
>>>>> StandardAnalyzer, it takes docs that are 1800 tokens just to be  
>>>>> as fast
>>>>> as re-analyzing. Notice I didn't say fast, but "as fast". Anything
>>>>> smaller, or if you're using a simpler analyzer, and  
>>>>> TokenSources is
>>>>> certainly not worth it. It just takes too long to pull  
>>>>> TermVector info.
>>>>>
>>>>> - Mark
>>>>>
>>>>>
>>>>>
>>>>> ------------------------------------------------------------------ 
>>>>> ---
>>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>
>>> -------------------------------------------------------------------- 
>>> -
>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message