lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "David Pilato (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (LUCENE-7541) FVH does not work well with phrases and multiple tags
Date Mon, 07 Nov 2016 10:56:00 GMT

     [ https://issues.apache.org/jira/browse/LUCENE-7541?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

David Pilato updated LUCENE-7541:
---------------------------------
    Description: 
I'm indexing a document with a field which is {{aaa bbb ccc ddd bbb eee fff}}.

I'm running a Bool Query which contains 2 should Phrase queries: {{aaa bbb}} and {{eee fff}}.

I'm using an FVH with two tags {{<1></1>}} and {{<2></2>}}.

It gives the correct result: {{<1>aaa bbb</1> ccc ddd bbb <2>eee fff</2>}}

With same settings, I'm now running with 2 should Phrase queries: {{aaa bbb}} and {{bbb eee}}.

I'm getting back a wrong result: {{<1>aaa bbb</1> ccc ddd <1>bbb eee</1>
fff}} where I'm expecting {{<1>aaa bbb</1> ccc ddd <2>bbb eee</2>
fff}}.

Why this?

Apparently the FVH is getting back as sequence numbers in the first case {{0}} and {{1}} but
in the second case {{0}} and {{2}}.

The problem is when we call then {{getPreTag}}, we are getting the first tag instead of the
second one:

{code:java}
  protected String getPreTag( String[] preTags, int num ){
    int n = num % preTags.length;
    return preTags[n];
  }
  
  protected String getPostTag( String[] postTags, int num ){
    int n = num % postTags.length;
    return postTags[n];
  }
{code:java}

I did not find yet how to fix that. But I believe it is somewhere in {{org.apache.lucene.search.vectorhighlight.FieldQuery}}
class

{code:java}
    private void markTerminal( int slop, float boost ){
      this.terminal = true;
      this.slop = slop;
      this.boost = boost;
      this.termOrPhraseNumber = fieldQuery.nextTermOrPhraseNumber();
    }
{code:java}

This call to {{nextTermOrPhraseNumber()}} increments the term number I guess because we have
already seen the term {{BBB}} previously.

I'm going to join a test case patch.


  was:
I'm indexing a document with a field which is {{aaa bbb ccc ddd bbb eee fff}}.

I'm running a Bool Query which contains 2 should Phrase queries: {{aaa bbb}} and {{eee fff}}.

I'm using an FVH with two tags {{<1></1>}} and {{<2></2>}}.

It gives the correct result: {{<1>aaa bbb</1> ccc ddd bbb <2>eee fff</2>}}

With same settings, I'm now running with 2 should Phrase queries: {{aaa bbb}} and {{bbb eee}}.

I'm getting back a wrong result: {{<1>aaa bbb</1> ccc ddd <1>bbb eee</1>
fff}} where I'm expecting {{<1>aaa bbb</1> ccc ddd <2>bbb eee</2>
fff}}.

Why this?

Apparently the FVH is getting back as sequence numbers in the first case {{0}} and {{1}} but
in the second case {{0}} and {{2}}.

The problem is when we call then {{getPreTag}}, we are getting the first tag instead of the
second one:

{{code:java}}
  protected String getPreTag( String[] preTags, int num ){
    int n = num % preTags.length;
    return preTags[n];
  }
  
  protected String getPostTag( String[] postTags, int num ){
    int n = num % postTags.length;
    return postTags[n];
  }
{{code:java}}

I did not find yet how to fix that. But I believe it is somewhere in {{org.apache.lucene.search.vectorhighlight.FieldQuery}}
class

{{code:java}}
    private void markTerminal( int slop, float boost ){
      this.terminal = true;
      this.slop = slop;
      this.boost = boost;
      this.termOrPhraseNumber = fieldQuery.nextTermOrPhraseNumber();
    }
{{code:java}}

This call to {{nextTermOrPhraseNumber()}} increments the term number I guess because we have
already seen the term {{BBB}} previously.

I'm going to join a test case patch.



> FVH does not work well with phrases and multiple tags
> -----------------------------------------------------
>
>                 Key: LUCENE-7541
>                 URL: https://issues.apache.org/jira/browse/LUCENE-7541
>             Project: Lucene - Core
>          Issue Type: Bug
>          Components: modules/highlighter
>    Affects Versions: trunk, 6.x
>            Reporter: David Pilato
>         Attachments: Add_test_for_FVH_with_phrase_and_multiple_tags_.patch
>
>
> I'm indexing a document with a field which is {{aaa bbb ccc ddd bbb eee fff}}.
> I'm running a Bool Query which contains 2 should Phrase queries: {{aaa bbb}} and {{eee
fff}}.
> I'm using an FVH with two tags {{<1></1>}} and {{<2></2>}}.
> It gives the correct result: {{<1>aaa bbb</1> ccc ddd bbb <2>eee fff</2>}}
> With same settings, I'm now running with 2 should Phrase queries: {{aaa bbb}} and {{bbb
eee}}.
> I'm getting back a wrong result: {{<1>aaa bbb</1> ccc ddd <1>bbb eee</1>
fff}} where I'm expecting {{<1>aaa bbb</1> ccc ddd <2>bbb eee</2>
fff}}.
> Why this?
> Apparently the FVH is getting back as sequence numbers in the first case {{0}} and {{1}}
but in the second case {{0}} and {{2}}.
> The problem is when we call then {{getPreTag}}, we are getting the first tag instead
of the second one:
> {code:java}
>   protected String getPreTag( String[] preTags, int num ){
>     int n = num % preTags.length;
>     return preTags[n];
>   }
>   
>   protected String getPostTag( String[] postTags, int num ){
>     int n = num % postTags.length;
>     return postTags[n];
>   }
> {code:java}
> I did not find yet how to fix that. But I believe it is somewhere in {{org.apache.lucene.search.vectorhighlight.FieldQuery}}
class
> {code:java}
>     private void markTerminal( int slop, float boost ){
>       this.terminal = true;
>       this.slop = slop;
>       this.boost = boost;
>       this.termOrPhraseNumber = fieldQuery.nextTermOrPhraseNumber();
>     }
> {code:java}
> This call to {{nextTermOrPhraseNumber()}} increments the term number I guess because
we have already seen the term {{BBB}} previously.
> I'm going to join a test case patch.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message