Return-Path: Delivered-To: apmail-lucene-dev-archive@www.apache.org Received: (qmail 99520 invoked from network); 2 Oct 2010 18:51:01 -0000 Received: from unknown (HELO mail.apache.org) (140.211.11.3) by 140.211.11.9 with SMTP; 2 Oct 2010 18:51:01 -0000 Received: (qmail 36410 invoked by uid 500); 2 Oct 2010 18:51:00 -0000 Delivered-To: apmail-lucene-dev-archive@lucene.apache.org Received: (qmail 36373 invoked by uid 500); 2 Oct 2010 18:50:59 -0000 Mailing-List: contact dev-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@lucene.apache.org Delivered-To: mailing list dev@lucene.apache.org Received: (qmail 36366 invoked by uid 99); 2 Oct 2010 18:50:59 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 02 Oct 2010 18:50:59 +0000 X-ASF-Spam-Status: No, hits=-2000.0 required=10.0 tests=ALL_TRUSTED X-Spam-Check-By: apache.org Received: from [140.211.11.22] (HELO thor.apache.org) (140.211.11.22) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 02 Oct 2010 18:50:57 +0000 Received: from thor (localhost [127.0.0.1]) by thor.apache.org (8.13.8+Sun/8.13.8) with ESMTP id o92IoZRs014735 for ; Sat, 2 Oct 2010 18:50:35 GMT Message-ID: <10523514.513551286045435434.JavaMail.jira@thor> Date: Sat, 2 Oct 2010 14:50:35 -0400 (EDT) From: "David Smiley (JIRA)" To: dev@lucene.apache.org Subject: [jira] Updated: (LUCENE-2529) always apply position increment gap between values In-Reply-To: <9832676.217731278433851943.JavaMail.jira@thor> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 X-Virus-Checked: Checked by ClamAV on apache.org [ https://issues.apache.org/jira/browse/LUCENE-2529?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] David Smiley updated LUCENE-2529: --------------------------------- Attachment: LUCENE-2529_skip_posIncr_for_1st_token.patch (patch updated) bq. Maybe, instead of that +1 inside IW, we change the default posIncrGap to 1? I had the +1 for the gap (i.e. between values) level because I was trying to get a blank value (or a value consisting of stop words) to bump the position counter as well. I've been tinkering with this a bit more and I realize now that I can still achieve my aims without doing that, but it's still necessary to ignore the very first position increment of the very first value -- only. See the new patch. I think the result now should be even more amenable to others (i.e. is least disruptive) since anyone messing with the position increment of the first token of subsequent values will still be honored. bq. Can you spell out examples of how the indexed positions will change w/ this patch - I'm having trouble visualizing this. EG for a single valued field, multi-valued, etc. A single valued field is unaffected. The first emitted token (if there are any at all) will remain at position 0 no matter what the analyzer does. This is also true for the first value of a multi-valued field if there is any. For multi-valued fields, it is now always the case that the first token of subsequent values (e.g. not the first value) will be the previous position (0 if none) + the gap + the first position increment of this value (typically 1). This is consistent and sensible. Formerly, if the first value was a blank value (or a value consisting of stop words), then you'd get 1 less than what you get now. I hope the test I modified as part of this patch makes this more clear; I had to increment the tested positions by 1. As I said before, I also think that the code is more clear since it no longer has that conditional pre-decrement and post increment of the position that was probably only understood by you. And I did away with the weird "+1" at the gap in my previous patch. bq. Man I really want to get this logic out of indexer and into the analysis chain (LUCENE-2450 enables this). How multi-valued streams should handle the transition from one value to another shouldn't be inside the indexer... and maybe (someday) tokens should store their position (not the gap) so we don't have this cryptic logic inside the indexer.. That sounds great. There are other strategies of messing with position increments that I simply can't do without hacking this code further. For example, it would be neat if the first token of a value could be devised to start at posIncGap*valueIndex (ex: 0, 1000, 2000, ...) so that Span queries could determine which value index a term matched against by looking at it's position (ex: 3092: divide by 1000, drop remainder, add 1: the 4th value ). > always apply position increment gap between values > -------------------------------------------------- > > Key: LUCENE-2529 > URL: https://issues.apache.org/jira/browse/LUCENE-2529 > Project: Lucene - Java > Issue Type: Bug > Components: Index > Affects Versions: 2.9.3, 3.0.2, 3.1, 4.0 > Environment: (I don't know which version to say this affects since it's some quasi trunk release and the new versioning scheme confuses me.) > Reporter: David Smiley > Assignee: Koji Sekiguchi > Priority: Minor > Fix For: 3.1, 4.0 > > Attachments: LUCENE-2529_always_apply_position_increment_gap_between_values.patch, LUCENE-2529_skip_posIncr_for_1st_token.patch, LUCENE-2529_skip_posIncr_for_1st_token.patch > > Original Estimate: 1h > Remaining Estimate: 1h > > I'm doing some fancy stuff with span queries that is very sensitive to term positions. I discovered that the position increment gap on indexing is only applied between values when there are existing terms indexed for the document. I suspect this logic wasn't deliberate, it's just how its always been for no particular reason. I think it should always apply the gap between fields. Reference DocInverterPerField.java line 82: > if (fieldState.length > 0) > fieldState.position += docState.analyzer.getPositionIncrementGap(fieldInfo.name); > This is checking fieldState.length. I think the condition should simply be: if (i > 0). > I don't think this change will affect anyone at all but it will certainly help me. Presently, I can either change this line in Lucene, or I can put in a hack so that the first value for the document is some dummy value which is wasteful. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org For additional commands, e-mail: dev-help@lucene.apache.org