lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "David Smiley (JIRA)" <j...@apache.org>
Subject [jira] Commented: (LUCENE-2529) always apply position increment gap between values
Date Mon, 04 Oct 2010 18:43:33 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-2529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12917697#action_12917697
] 

David Smiley commented on LUCENE-2529:
--------------------------------------

bq. My suggestion is that if you have values like this with position dependencies, they are
really one single value, not independent values, and don't belong in a multivalued-field.

My use-case is not like your example.  The value at index x has no relation to values before
or after it in the same field.  It _does_ have a relationship to a separate multi-valued field's
value at the same value index.  With this patch and an analyzer that sets all position increments
to 0, all tokens for the same value index across both fields will have the same token position.

bq. (me)    For my problem space, I'm willing to sacrifice the ability to do phrase queries.

bq. Right, but my concern is that other users are not.  I don't think we should discard the
first token's position increment value completely, will the QueryParser do this too?

The fact that my entire solution (for which this patch is a subset) can't do phrases is not
evident in this patch.  Perhaps I should have kept my overall use case a secret so as not
to cloud the purpose of this patch.  This patch is about generating predictable/intuitive/sensible
(in my opinion) position values, particularly at the very start of a field.  I hope you would
share my opinion if you apply the patch and examine the results such as what the test case
the patch modifies does.

In my opinion (and apparently Mike McCandless -- "I like the idea of disregarding the posIncrGap
of the first token.") I disagree.  The point of a position increment gap is only meaningful
_between_ values.  It's meaningless for the first token.  At your suggestion I looked in QueryParser.java
to look for problems that may have to do with the position increment.  I don't see any problems
that would be introduced by this patch.  For convenient reference here, the position increment
is grabbed at line 608:
{code:java}
int positionIncrement = (posIncrAtt != null) ? posIncrAtt.getPositionIncrement() : 1;
if (positionIncrement != 0) {
  positionCount += positionIncrement;
} else {
  severalTokensAtSamePosition = true;
}
{code}

The ensuing logic seems pretty sane to me (albeit complicated).  The only thing in here that
could arguably be improved is line 645:
{code:java}
if (positionCount == 1 || (!quoted && !autoGeneratePhraseQueries)) {
{code}
I think the "== 1" should be "<= 1" just in case there's some intentional oddities in the
analyzer that 99.9% of people wouldn't ever do.  I'm in the exception case.  But I have a
different query time analyzer so I don't care that much.


> always apply position increment gap between values
> --------------------------------------------------
>
>                 Key: LUCENE-2529
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2529
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Index
>    Affects Versions: 2.9.3, 3.0.2, 3.1, 4.0
>         Environment: (I don't know which version to say this affects since it's some
quasi trunk release and the new versioning scheme confuses me.)
>            Reporter: David Smiley
>            Assignee: Koji Sekiguchi
>            Priority: Minor
>             Fix For: 3.1, 4.0
>
>         Attachments: LUCENE-2529_always_apply_position_increment_gap_between_values.patch,
LUCENE-2529_skip_posIncr_for_1st_token.patch, LUCENE-2529_skip_posIncr_for_1st_token.patch,
LUCENE-2529_test.patch
>
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> I'm doing some fancy stuff with span queries that is very sensitive to term positions.
 I discovered that the position increment gap on indexing is only applied between values when
there are existing terms indexed for the document.  I suspect this logic wasn't deliberate,
it's just how its always been for no particular reason.  I think it should always apply the
gap between fields.  Reference DocInverterPerField.java line 82:
> if (fieldState.length > 0)
>           fieldState.position += docState.analyzer.getPositionIncrementGap(fieldInfo.name);
> This is checking fieldState.length.  I think the condition should simply be:  if (i >
0).
> I don't think this change will affect anyone at all but it will certainly help me.  Presently,
I can either change this line in Lucene, or I can put in a hack so that the first value for
the document is some dummy value which is wasteful.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message