lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Stephen Thomas <stephen.warner.tho...@gmail.com>
Subject Do duplicate documents affect term scoring?
Date Sun, 27 Nov 2011 22:42:35 GMT
List,

I am indexing a subset of Wikipedia. I have 4 years worth of data, and
have taken snapshots of each document at each month in the 4 year
span. Thus, I have 4*12=36 versions of each document. (I keep track of
the timestamp in a field.) I have noticed that in many cases, a
Wikipedia document does not change very much between each version,
sometimes not at all. I end up with duplicate documents, the only
different is the timestamp. Does this impact the term weighting used
by Lucene?

My intuition is that if a term only occurs in one document, but that
document occurs 36 times, then the frequency of the term is
"artificially" increased. Is this true? And if so, is this something I
need to worry about?

Thanks,
Steve

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message