lucene-solr-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ted Dunning <ted.dunn...@gmail.com>
Subject Re: Implementing near duplicate detection algorithm using IDF statistics
Date Wed, 24 Mar 2010 19:01:00 GMT
For reference, you can get a rental copy of this article for less than the
cost of the full PDF download here:


http://www.deepdyve.com/lp/association-for-computing-machinery/collection-statistics-for-fast-duplicate-document-detection-0o7i3Sx0Wd

(joining the ACM is also a good thing to do)

(and yes, this is licensed by the ACM)

On Wed, Mar 24, 2010 at 2:28 AM, Thomas Heigl <thomas.heigl@systemone.at>wrote:

> Hello,
>
> For my current project I need to implement an index-time mechanism to
> detect (near) duplicate documents. The TextProfileSignature available
> out-of-the-box (http://wiki.apache.org/solr/Deduplication) seems alright
> but does not use global collection statistics in deciding which terms
> will be used for calculating the signature.
> Most state-of-the-art hash-based duplication detection algorithms make
> use of this information to improve precision and recall (e.g.
>
> http://portal.acm.org/citation.cfm?id=506311&dl=GUIDE&coll=GUIDE&CFID=83187370&CFTOKEN=47052122
> )
>
> Is it possible to access collection statistics - especially IDF values
> for all non-discarded terms in the current document - from within an
> implementation of the Signature class?
>
> Kind regards,
>
> Thomas
>
> --
> DDI Thomas Heigl
> Software Engineer
> --------------------------------------------
> System One
> Gesellschaft f├╝r technologiegest├╝tzte
> Kommunikationsprozesse m.b.H.
> Stiftgasse 6/2/6
> thomas.heigl@systemone.at
> http://www.systemone.at
> Powered by Open-Xchange.com
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message