lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mark Miller <markrmil...@gmail.com>
Subject Re: Highlighter that works with phrase and span queries
Date Wed, 29 Aug 2007 11:59:54 GMT
The patch you refer to should include the javadoc/source code. If that 
is not sufficient, drop me a line privately and I will email you all of 
the source code / javadoc.

- Mark

Michael Stoppelman wrote:
> Ah, much clearer now. It seems that the jar file is just the class files. Is
> the source/javadoc code somewhere else?
>
> -M
>
> On 8/27/07, Mark Miller <markrmiller@gmail.com> wrote:
>   
>> I am a bit unclear about your question. The patch you mention extends
>> the original Highlighter to support phrase and span queries. It does not
>> include any major performance increases over the original Highlighter
>> (in fact, it takes a bit longer to Highlight a Span or Phrase query than
>> it does to just highlight Terms).
>>
>> Will it be released with the next version of Lucene? Doesn't look like
>> it, but anything is possible. A few people are using it, but there has
>> not been widespread interest that I have seen. My guess is that there
>> are just not enough people trying to highlight Span queries -- which I'd
>> blame on a lack of Span support in the default Lucene Query syntax.
>>
>> Whether it is included soon or not, the code works well and I will
>> continue to support it.
>>
>> - Mark
>>
>> Michael Stoppelman wrote:
>>     
>>> Is this jar going to be in the next release of lucene? Also, are these
>>>       
>> the
>>     
>>> same as the changes in the following patch:
>>>
>>>       
>> https://issues.apache.org/jira/secure/attachment/12362653/spanhighlighter10.patch
>>     
>>> -M
>>>
>>> On 6/27/07, Mark Miller <markrmiller@gmail.com> wrote:
>>>
>>>       
>>>>> I have not looked at any highlighting code yet. Is there already an
>>>>>
>>>>>           
>>>> extension
>>>>
>>>>         
>>>>> of PhraseQuery that has getSpans() ?
>>>>>
>>>>>
>>>>>           
>>>> Currently I am using this code originally by M. Harwood:
>>>>             Term[] phraseQueryTerms = ((PhraseQuery) query).getTerms();
>>>>             int i;
>>>>             SpanQuery[] clauses = new SpanQuery[phraseQueryTerms.length
>>>>         
>> ];
>>     
>>>>             for (i = 0; i < phraseQueryTerms.length; i++) {
>>>>                 clauses[i] = new SpanTermQuery(phraseQueryTerms[i]);
>>>>             }
>>>>
>>>>             SpanNearQuery sp = new SpanNearQuery(clauses,
>>>>                     ((PhraseQuery) query).getSlop(), false);
>>>>             sp.setBoost(query.getBoost());
>>>>
>>>> I don't think it is perfect logic for PhraseQuery's edit distance, but
>>>> it approximates extremely well in most cases.
>>>>
>>>> I wonder if this approach to Highlighting would be worth it in the end.
>>>> Certainly, it would seem to require that you store offsets or you would
>>>> have to re-tokenize anyway.
>>>>
>>>> Some more interesting "stuff" on the current Highlighter methods:
>>>>
>>>> We can gain a lot of speed on the implementation of the current
>>>> Highlighter if we grab from the source text in bigger chunks. Ronnie's
>>>> Highlighter appears to be faster than the original due to two things:
>>>>         
>> he
>>     
>>>> doesn't have to re-tokenize text and he rebuilds the original document
>>>> in large pieces. Depending on how you want to look at it, he loses most
>>>> of the speed gained from just looking at the Query tokens instead of
>>>>         
>> all
>>     
>>>> tokens to pulling the Term offset information (which appears pretty
>>>>         
>> slow).
>>     
>>>> If you use a SimpleAnalyzer on docs around 1800 tokens long, you can
>>>> actually match the speed of Ronnies highlighter with the current
>>>> highlighter if you just rebuild the highlighted documents in bigger
>>>> pieces i.e. instead of going through each token and adding the source
>>>> text that it covers, build up the offset information until you get
>>>> another hit and then pull from the source text into the highlighted
>>>>         
>> text
>>     
>>>> in one big piece rather than a tokens worth at a time. Of course this
>>>>         
>> is
>>     
>>>> not compatible with the way the Fragmenter currently works. If you use
>>>> the StandardAnalyzer instead of SimpleAnalyzer, Ronnie's highlighter
>>>> wins because it takes so darn long to re-analyze.
>>>>
>>>> It is also interesting to note that it is very difficult to see in a
>>>> gain in using TokenSources to build a TokenStream. Using the
>>>> StandardAnalyzer, it takes docs that are 1800 tokens just to be as fast
>>>> as re-analyzing. Notice I didn't say fast, but "as fast". Anything
>>>> smaller, or if you're using a simpler analyzer, and TokenSources is
>>>> certainly not worth it. It just takes too long to pull TermVector info.
>>>>
>>>> - Mark
>>>>
>>>>
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>>
>>>>
>>>>
>>>>         
>>>       
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
>>     
>
>   

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message