lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chris Hostetter <>
Subject Re: Indexing the same field multiple times in a doc.
Date Tue, 15 Aug 2006 22:16:17 GMT
: Here's the root question: "Am I reasonably safe, for a single document, in
: thinking of indexing multiple chunks with the same field as being identical,
: for all practical purposes, with indexing the field once with all the chunks
: concatenated together?".

esentially that's true -- the differnce is in the way the
positionIncrimentGap method of your analyzer is used.  For a single large
field value, it is never used.  for two or more values, it's called in
between each value.

: What surprised me a bit is that SpanQueries work just fine this way. If I
: create a span query for "two" and "five", this doc is found for some slop
: factors and not found for other slop factors, just as though I indexed the
: "tokens" field once with "one two three four five six".

right ... that's where the positionIncrimentGap can come in handy .. you
can introduce a "large" gap size so that you can make phrase/span queries
which match across multiple values, and others which don't.

: So, are there any "gotchas" that spring to mind with the notion of chunking
: the input to < 10,000 words and indexing the chunks multiple times in the
: same field? Let me be clear I'm just beginning to design this, so all I'm

well, there's really nothing wrong with using multiple values, as far as
doing that to deal with the 10,000 terms limit...

 a) if you don't wnat the limit, change the limit -- there's no reason to
work arround it.
 b) i'm not entirely sure if that limit is on a single field value, or on
the total number of indexed tokens for that field name -- in which case
this approach doesn't work arround it at all -- make sure you test with
two fields whose total number of tokens is bigger then 10,000


To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message