lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael Sokolov <msoko...@safaribooksonline.com>
Subject Re: highlighting the boolean query
Date Tue, 24 Feb 2015 18:19:50 GMT
There is also PostingsHighlighter -- I recommend it, if only for the 
performance improvement, which is substantial, but I'm not completely 
sure how it handles this issue.  The one drawback I *am* aware of is 
that it is insensitive to positions (so words from phrases get 
highlighted even in isolation)

-Mike


On 02/24/2015 12:46 PM, Erik Hatcher wrote:
> BooleanQuery’s extractTerms looks like this:
>
> public void extractTerms(Set<Term> terms) {
>    for (BooleanClause clause : clauses) {
>      if (clause.isProhibited() == false) {
>        clause.getQuery().extractTerms(terms);
>      }
>    }
> }
> that’s generally the method called by the Highlighter for what terms should be highlighted.
 So even if a term didn’t match the document, the query that the term was in matched the
document and it just blindly highlights all the terms (minus prohibited ones).   That at least
explains the behavior you’re seeing, but it’s not ideal.  I’ve seen specialized highlighters
that convert to spans, which are accurate to the exact matches within the document.  Been
a while since I dug into the HighlightComponent, so maybe there’s some other options available
out of the box?
>
> —
> Erik Hatcher, Senior Solutions Architect
> http://www.lucidworks.com <http://www.lucidworks.com/>
>
>
>
>
>> On Feb 24, 2015, at 3:16 AM, Dmitry Kan <solrexpert@gmail.com> wrote:
>>
>> Erick,
>>
>> Our default operator is AND.
>>
>> Both queries below parse the same:
>>
>> a OR (b c) OR d
>> a OR (b AND c) OR d
>>
>> The parsed query:
>>
>> <str name="parsedquery_toString">Contents:a (+Contents:b +Contents:c)
>> Contents:d</str>
>>
>> So this part is consistent with our expectation.
>>
>>
>>>> I'm a bit puzzled by your statement that "c" didn't contribute to the
>> score.
>> what I meant was that the term c was not hit by the scorerer: the explain
>> section does not refer to it. I'm using the made up terms here, but the
>> concept holds.
>>
>> The code suggests that we could benefit from storing term offsets and
>> positions:
>> http://grepcode.com/file/repo1.maven.org/maven2/org.apache.solr/solr-core/4.3.1/org/apache/solr/highlight/DefaultSolrHighlighter.java#470
>>
>> Is it correct assumption?
>>
>> On Mon, Feb 23, 2015 at 8:29 PM, Erick Erickson <erickerickson@gmail.com>
>> wrote:
>>
>>> Highlighting is such a pain...
>>>
>>> what does the parsed query look like? If the default operator is OR,
>>> then this seems correct as both 'd' and 'c' appear in the doc. So
>>> I'm a bit puzzled by your statement that "c" didn't contribute to the
>>> score.
>>>
>>> If the parsed query is, indeed
>>> a +b +c d
>>>
>>> then it does look like something with the highlighter. Whether other
>>> highlighters are better for this case.. no clue ;(
>>>
>>> Best,
>>> Erick
>>>
>>> On Mon, Feb 23, 2015 at 9:36 AM, Dmitry Kan <solrexpert@gmail.com> wrote:
>>>> Erick,
>>>>
>>>> nope, we are using std lucene qparser with some customizations, that do
>>> not
>>>> affect the boolean query parsing logic.
>>>>
>>>> Should we try some other highlighter?
>>>>
>>>> On Mon, Feb 23, 2015 at 6:57 PM, Erick Erickson <erickerickson@gmail.com
>>>>
>>>> wrote:
>>>>
>>>>> Are you using edismax?
>>>>>
>>>>> On Mon, Feb 23, 2015 at 3:28 AM, Dmitry Kan <solrexpert@gmail.com>
>>> wrote:
>>>>>> Hello!
>>>>>>
>>>>>> In solr 4.3.1 there seem to be some inconsistency with the
>>> highlighting
>>>>> of
>>>>>> the boolean query:
>>>>>>
>>>>>> a OR (b c) OR d
>>>>>>
>>>>>> This returns a proper hit, which shows that only d was included into
>>> the
>>>>>> document score calculation.
>>>>>>
>>>>>> But the highlighter returns both d and c in <em> tags.
>>>>>>
>>>>>> Is this a known issue of the standard highlighter? Can it be
>>> mitigated?
>>>>>>
>>>>>> --
>>>>>> Dmitry Kan
>>>>>> Luke Toolbox: http://github.com/DmitryKey/luke
>>>>>> Blog: http://dmitrykan.blogspot.com
>>>>>> Twitter: http://twitter.com/dmitrykan
>>>>>> SemanticAnalyzer: www.semanticanalyzer.info
>>>>
>>>>
>>>> --
>>>> Dmitry Kan
>>>> Luke Toolbox: http://github.com/DmitryKey/luke
>>>> Blog: http://dmitrykan.blogspot.com
>>>> Twitter: http://twitter.com/dmitrykan
>>>> SemanticAnalyzer: www.semanticanalyzer.info
>>
>>
>> -- 
>> Dmitry Kan
>> Luke Toolbox: http://github.com/DmitryKey/luke
>> Blog: http://dmitrykan.blogspot.com
>> Twitter: http://twitter.com/dmitrykan
>> SemanticAnalyzer: www.semanticanalyzer.info
>


Mime
View raw message