lucene-solr-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Thomas Heigl <thomas.he...@systemone.at>
Subject Implementing near duplicate detection algorithm using IDF statistics
Date Wed, 24 Mar 2010 09:28:15 GMT
Hello,

For my current project I need to implement an index-time mechanism to
detect (near) duplicate documents. The TextProfileSignature available
out-of-the-box (http://wiki.apache.org/solr/Deduplication) seems alright
but does not use global collection statistics in deciding which terms
will be used for calculating the signature.
Most state-of-the-art hash-based duplication detection algorithms make
use of this information to improve precision and recall (e.g.
http://portal.acm.org/citation.cfm?id=506311&dl=GUIDE&coll=GUIDE&CFID=83187370&CFTOKEN=47052122)

Is it possible to access collection statistics - especially IDF values
for all non-discarded terms in the current document - from within an
implementation of the Signature class?

Kind regards,

Thomas

--
DDI Thomas Heigl
Software Engineer
--------------------------------------------
System One
Gesellschaft f├╝r technologiegest├╝tzte
Kommunikationsprozesse m.b.H.
Stiftgasse 6/2/6
thomas.heigl@systemone.at
http://www.systemone.at
Powered by Open-Xchange.com

Mime
View raw message