lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Tatu Saloranta <t...@hypermall.net>
Subject Position increment (tokens, DocumentWriter), max field length
Date Fri, 05 Dec 2003 05:26:35 GMT
...
> > So either this patch should be pulled, or we need to add
> > position-increment-like support to PhraseQuery.  I plan to do the
> > latter in the next few months (for a contract I'm working on) so
> > perhaps we should just pull this patch until PhraseQuery is updated,
> > at which time we can consider updating QueryParser to take advantage
> > of this feature.
>
> Sounds good to me.  I can't wait to see the new and improved
> PhraseQuery!

I have a question related to the way position increment is handled in 
DocumentWriter's invertDocument (main tokenization/indexing method). It does 
following:

            for (Token t = stream.next(); t != null; t = stream.next()) {
              position += (t.getPositionIncrement() - 1);
              addPosition(fieldName, t.termText(), position++);
              if (position > maxFieldLength) break;
            }

If I'm not mistaken, this means that maxFieldLength comparison counts in 
"holes" in token sequence. And such behaviour might be problematic,
especially if such holes are used to mark sentence/paragraph boundaries (to 
reduce score or avoid hit for phrase queries), which was discussed recently.
Also, since that count is saved in index, such holes "bloat" perceived 
document size, and thus reduce document's relative weight.

It'd be easy to fix this to only count tokens (I can provide patch if so), but 
I wanted to make sure I don't misunderstand something fundamental here?

-+ Tatu +-


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org


Mime
View raw message