lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Robert Muir (JIRA)" <j...@apache.org>
Subject [jira] Commented: (LUCENE-2529) always apply position increment gap between values
Date Mon, 04 Oct 2010 17:27:33 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-2529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12917663#action_12917663
] 

Robert Muir commented on LUCENE-2529:
-------------------------------------

bq. Rob, I don't completely follow your first paragraph

What i was trying to say, is that there's no way for positions to be properly accumulated
across multi-valued fields.
for example (i will use the pipe as a field separator and assume english stopwords):
{noformat}
brown fox | went to | market
{noformat}

In this case the index will "lose" the 2 position increments caused by "went", and "to", and
they 
won't be reflected in the "market" position.

My suggestion is that if you have values like this with position dependencies, they are really
one single value, not independent values, and don't belong in a multivalued-field.

In this case, if you simply index the entire content as one field, and in your tokenstream
handle the 
separator however you want, and the "market" token will properly reflect whatever you previously
did 
with the tokens, either via that separator and/or stopwords or other things.

bq. For my problem space, I'm willing to sacrifice the ability to do phrase queries.

Right, but my concern is that other users are not. 
I don't think we should discard the first token's position increment value completely, will
the QueryParser do this too?

bq. My patch here (and the patch already applied by Koji recently) for this issue isn't really
code specific to the problem I'm solving, but it is necessary for my approach

The previous patch (the one described on the issue) I definitely agreed with. 
But what you speak of here (discarding the first token's position) is different, 
and I'm not convinced its necessary for your approach (you could use a single-valued field).

bq. All existing tests pass. On the basis of that alone, I'm hopeful that you, Michael, and
other committers are amenable to applying this patch.

Well, unfortunately (not your fault at all!) that isn't very comforting to me. 
For example, the queryparser has very minimal tests wrt this sorta stuff, yet
as I mentioned above its important to think about how it consumes tokenstreams, 
because if its inconsistent with the indexer then queries start returning less results.


> always apply position increment gap between values
> --------------------------------------------------
>
>                 Key: LUCENE-2529
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2529
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Index
>    Affects Versions: 2.9.3, 3.0.2, 3.1, 4.0
>         Environment: (I don't know which version to say this affects since it's some
quasi trunk release and the new versioning scheme confuses me.)
>            Reporter: David Smiley
>            Assignee: Koji Sekiguchi
>            Priority: Minor
>             Fix For: 3.1, 4.0
>
>         Attachments: LUCENE-2529_always_apply_position_increment_gap_between_values.patch,
LUCENE-2529_skip_posIncr_for_1st_token.patch, LUCENE-2529_skip_posIncr_for_1st_token.patch
>
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> I'm doing some fancy stuff with span queries that is very sensitive to term positions.
 I discovered that the position increment gap on indexing is only applied between values when
there are existing terms indexed for the document.  I suspect this logic wasn't deliberate,
it's just how its always been for no particular reason.  I think it should always apply the
gap between fields.  Reference DocInverterPerField.java line 82:
> if (fieldState.length > 0)
>           fieldState.position += docState.analyzer.getPositionIncrementGap(fieldInfo.name);
> This is checking fieldState.length.  I think the condition should simply be:  if (i >
0).
> I don't think this change will affect anyone at all but it will certainly help me.  Presently,
I can either change this line in Lucene, or I can put in a hack so that the first value for
the document is some dummy value which is wasteful.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message