lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Yonik Seeley (JIRA)" <j...@apache.org>
Subject [jira] Commented: (LUCENE-2668) offset gap should be added regardless of existence of tokens in DocInverterPerField
Date Sun, 26 Sep 2010 13:20:33 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-2668?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12914972#action_12914972
] 

Yonik Seeley commented on LUCENE-2668:
--------------------------------------

bq. Or do we think apps are not relying on this quirky behavior?

I'd doubt there's a single one relying on the current behavior - in fact I think it's more
likely that there's an app out there relying on the proposed behavior and they just haven't
hit the case when no tokens were indexed for a field value.


> offset gap should be added regardless of existence of tokens in DocInverterPerField
> -----------------------------------------------------------------------------------
>
>                 Key: LUCENE-2668
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2668
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Index
>    Affects Versions: 2.9.3, 3.0.2, 3.1, 4.0
>            Reporter: Koji Sekiguchi
>            Priority: Minor
>         Attachments: LUCENE-2668.patch, LUCENE-2668.patch, Test.java
>
>
> Problem: If a multiValued field which contains a stop word (e.g. "will" in the following
sample) only value is analyzed by StopAnalyzer when indexing, the offsets of the subsequent
tokens are not correct.
> {code:title=indexing a multiValued field}
> doc.add( new Field( F, "Mike", Store.YES, Index.ANALYZED, TermVector.WITH_OFFSETS ) );
> doc.add( new Field( F, "will", Store.YES, Index.ANALYZED, TermVector.WITH_OFFSETS ) );
> doc.add( new Field( F, "use", Store.YES, Index.ANALYZED, TermVector.WITH_OFFSETS ) );
> doc.add( new Field( F, "Lucene", Store.YES, Index.ANALYZED, TermVector.WITH_OFFSETS )
);
> {code}
> In this program (soon to be attached), if you use WhitespaceAnalyzer, you'll get the
offset(start,end) for "use" and "Lucene" will be use(10,13) and Lucene(14,20). But if you
use StopAnalyzer, the offsets will be use(9,12) and lucene(13,19). When searching, since searcher
cannot know what analyzer was used at indexing time, this problem causes out of alignment
of FVH.
> Cause of the problem: StopAnalyzer filters out "will", anyToken flag set to false then
offset gap is not added in DocInverterPerField:
> {code:title=DocInverterPerField.java}
> if (anyToken)
>   fieldState.offset += docState.analyzer.getOffsetGap(field);
> {code}
> I don't understand why the condition is there... If always the gap is added, I think
things are simple.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message