lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chris Hostetter <>
Subject Re: strange idf in Lucene 2.1
Date Thu, 12 Apr 2007 23:29:16 GMT

: But if now the index goes through a massive update, where almost all the
: docs containing TC are deleted, and TC is not in any newly added doc,
: practically TC becomes rare too, and hence D2 should probably be scored
: higher than D1. But IDF(TC) might not (yet) reflect the massive docs
: deletion, and the scores are wrongly biased so D1 is still scored higher
: than D2.

yeah ... i was only thinking about the numDocs change (which would be the
same for idf(TC) and idf(TR)) and forgot that docFreq is ignorant of
deletes as well.

: I didn't follow the code for that, just thinking IDFs and scoring aloud, so
: I hope I am not missing something, but in any case this is just for the
: sake of discussion, because in reality you don't expect index statistics to
: change that dramatically, ahead of merges.

that's really the key issue ot remember ... you might notice this when
deleting/re-adding 90% of the docs in an index consisting of only 10 docs,
because you'll likely still only have one segment -- but if you do the
same thing in an index of 100,000 docs you're going to get some segment
merges which will help keep things balanced.


To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message