lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ian Lea <ian....@gmail.com>
Subject Re: Do duplicate documents affect term scoring?
Date Mon, 28 Nov 2011 09:10:20 GMT
Lucene won't be aware that you've got duplicate documents, but scoring
does take account of the number of documents in which search terms
appear.  See http://lucene.apache.org/java/3_5_0/scoring.html and the
javadocs for oal.search.Similarity.

Only you can say whether or not you need to worry about it,  If you
do, you could provide your own implementation of Similarity.  Or
change your indexing process to skip updates where only the timestamp
changes.


--
Ian.


On Sun, Nov 27, 2011 at 10:42 PM, Stephen Thomas
<stephen.warner.thomas@gmail.com> wrote:
> List,
>
> I am indexing a subset of Wikipedia. I have 4 years worth of data, and
> have taken snapshots of each document at each month in the 4 year
> span. Thus, I have 4*12=36 versions of each document. (I keep track of
> the timestamp in a field.) I have noticed that in many cases, a
> Wikipedia document does not change very much between each version,
> sometimes not at all. I end up with duplicate documents, the only
> different is the timestamp. Does this impact the term weighting used
> by Lucene?
>
> My intuition is that if a term only occurs in one document, but that
> document occurs 36 times, then the frequency of the term is
> "artificially" increased. Is this true? And if so, is this something I
> need to worry about?
>
> Thanks,
> Steve
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message