lucene-solr-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Thomas Heigl <>
Subject Implementing near duplicate detection algorithm using IDF statistics
Date Wed, 24 Mar 2010 09:28:15 GMT

For my current project I need to implement an index-time mechanism to
detect (near) duplicate documents. The TextProfileSignature available
out-of-the-box ( seems alright
but does not use global collection statistics in deciding which terms
will be used for calculating the signature.
Most state-of-the-art hash-based duplication detection algorithms make
use of this information to improve precision and recall (e.g.

Is it possible to access collection statistics - especially IDF values
for all non-discarded terms in the current document - from within an
implementation of the Signature class?

Kind regards,


DDI Thomas Heigl
Software Engineer
System One
Gesellschaft f├╝r technologiegest├╝tzte
Kommunikationsprozesse m.b.H.
Stiftgasse 6/2/6
Powered by

View raw message