lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Pierre Gossé (JIRA) <j...@apache.org>
Subject [jira] [Issue Comment Edited] (LUCENE-3087) highlighting exact phrase with overlapping tokens fails.
Date Thu, 12 May 2011 14:17:55 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-3087?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13032421#comment-13032421
] 

Pierre Gossé edited comment on LUCENE-3087 at 5/12/11 2:17 PM:
---------------------------------------------------------------

Thanks for taking a look at this Michael.

In fact, I should be in the case of TermVector.WITH_POSITIONS_OFFSETS, using this parameters
in my solr Shema.xml
<field name="..." type="..." indexed="true" stored="true" compressed="true" omitNorms="true"
termVectors="true" termPositions="true" termOffsets="true"/>

Somehow, I end up in TokenSources with argument tokenPositionsGuaranteedContiguous to false,
which falls back to using offsets instead of positions.

Maybe this is because of my overlapping tokens, maybe not, I'll have to take a couple of hours
sometime to figure this out. At first sight, however it seams this parameter is always set
to false when calling TokenSource.getTokenStream with an IndexReader because some code to
use field infos is missing.

Some work to do here, maybe, sometime. :)

      was (Author: pigo):
    Thanks for taking a look at this Michael.

In fact, I should be in the case of TermVector.WITH_POSITIONS_OFFSETS, using this parameters
in my solr Shema.xml
<field name="highlight_en" type="hst2-en" indexed="true" stored="true" compressed="true"
omitNorms="true" termVectors="true" termPositions="true" termOffsets="true"/>

Somehow, I end up in TokenSources with argument tokenPositionsGuaranteedContiguous to false,
which falls back to using offsets instead of positions.

Maybe this is because of my overlapping tokens, maybe not, I'll have to take a couple of hours
sometime to figure this out. At first sight, however it seams this parameter is always set
to false when calling TokenSource.getTokenStream with an IndexReader because some code to
use field infos is missing.

Some work to do here, maybe, sometime. :)
  
> highlighting exact phrase with overlapping tokens fails.
> --------------------------------------------------------
>
>                 Key: LUCENE-3087
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3087
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: contrib/highlighter
>    Affects Versions: 2.9.4, 3.1
>            Reporter: Pierre Gossé
>            Assignee: Michael McCandless
>            Priority: Minor
>             Fix For: 3.2, 4.0
>
>         Attachments: LUCENE-3087.patch
>
>
> Fields with overlapping token are not highlighted in search results when searching exact
phrases, when using TermVector.WITH_OFFSET.
> The document builded in MemoryIndex for highlight does not preserve positions of tokens
in this case. Overlapping tokens get "flattened" (position increment always set to 1), the
spanquery used for searching relevant fragment will fail to identify the correct token sequence
because the position shift.
> I corrected this by adding a position increment calculation in sub class StoredTokenStream.
I added junit test covering this case.
> I used the eclipse codestyle from trunk, but style add quite a few format differences
between repository and working copy files. I tried to reduce them, but some linewrapping rules
still doesn't match.
> Correction patch joined

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message