lucene-solr-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chris Hostetter <>
Subject Re: Implementing near duplicate detection algorithm using IDF statistics
Date Wed, 31 Mar 2010 16:58:52 GMT

: Is it possible to access collection statistics - especially IDF values
: for all non-discarded terms in the current document - from within an
: implementation of the Signature class?

The Signature API just lets you compute a unique value from a pile of 
Strings, but you could extend the SignatureUpdateProcessorFactory to only 
give the Signature class specific field values based on IDF values (which 
are available to the SignatureUpdateProcessorFactory via the IndexReader 
via the SolrCore via the SolrQueryRequest)

The complication you will run into with an approach like this, is that the 
UpdateProcessor pipeline happens before Analysis (it has to since it might 
be adding/removing fields from the documents) so the String values haven't 
been tokenized yet, so you can't easily "lookup" the IDF of the terms in 
the doc ... you'd have to do your own preliminary Analysis of the raw 
field values.


View raw message