lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chris Hostetter <hossman_luc...@fucit.org>
Subject Re: Strange behavior of positionIncrementGap
Date Fri, 11 Aug 2006 19:08:35 GMT
: For example, if a field F has values A, B and C the following example
: cases arise:
:   1.  A and B both generate no tokens ==> no positionIncrementGaps are
: generated
:   2.  A has no tokens but B does ==> just the gap between B and C
:   3.  A has tokens but B and C do not ==> both gaps between A and B, and
: between B and C are generated
:
: So, empty fields are treated anomalously.  They are ignored for gap
: purposes at the beginning of the field list, but included if they occur
: later in the field list.

Since positions are allways relative, i'm not sure i understnad how this
caused a problem in for you ... but I suspect it's because there's more to
what you describe ... in each of the 3 causes you outlined what happens if
there is a field value "D" which allways produces tokens?  Based on your
description so far, i'm guessing the following scenerio (using lower case
to indicate no tokens produced and upper case to indicate tokens were
produced) ...

1) a b C _gap_ D             ...results in:  C _gap_ D
2) a B _gap_ C _gap_ D       ...results in:  B _gap_ C _gap_ D
3) A _gap_ b _gap_ c _gap_ D ...results in:  A _double_gap_ D

...is that the behavior you are seeing?

Only case #3 seems "wrongish" to me there. ... i started to explain why i
thought it made sense to go ahead and "fix this", where by fix i ment only
insert one gap in case#3 ... and then realized i was acctually arguing in
favor of the current behavior for case#3, here is why...

   based on the semi-frequently discussed usage of token gap sizes to
   denote sentence/paragraph/page boundaries for the purpose of sloppy
   phrase queries, it certianly seems worthwhile to fix to me (so that
   queries like "find Erik within 3 pages of Otis" still work even if one
   of those pages is blank ...

...that's when i realized the current behavior of case#3 is acctually
important for accurate matching, otherwise a search for two words within a
certain number of pages would have a false match if those pages were
blank.  case #1 seems fine, but case #2 seems like the "wrong" case to me
know, becuase trying to find occurances of "B" on page #1 using a
SpanFirst query will have false positives ... it seems like the
positionIncrimentGap should always be called/used after any field value is
added (even if the value results in no okens) before the next value is
added (even if that value results in no tokens)


Does this jive with what you were expecting, and the patch you were
considering?




-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message