Dear Wiki user,
You have subscribed to a wiki page or wiki category on "Solr Wiki" for change notification.
The "TextProfileSignature" page has been changed by JoelNothman:
http://wiki.apache.org/solr/TextProfileSignature
Comment:
description of algorithm
New page:
TextProfileSignature calculates a fuzzy hash of textual fields for [[Deduplication]], and
may be incorporated using a SignatureUpdateProcessorFactory definition including the following
parameters:
 Name  Type  Description  Default value 
 `minTokenLen`  int  The minimum token length to consider  2 
 `quantRate`  float  When multiplied by the maximum token frequency, this determines
count quantization  .01 
The signature calculation proceeds as follows:
=== Tokenization and normalization ===
* Tokens are contiguous alphanumeric characters
* Normalized to lowercase
* Discarded if shorter than `minTokenLen`
Tokens are then counted, tracking the frequency `maxFreq` of the most frequent token.
=== Count quantization ===
A value `quant` is calculated as follows:
  1  if `maxFreq` <= 1 
`quant` :=  2  if round(`maxFreq * quantRate`) < 2 
  round(`maxFreq * quantRate`)  otherwise 
Token frequencies are then rounded down to the nearest multiple of `quant`, and any token
occurring less than `quant` times is discarded.
=== Hashing ===
The set of frequencies is transformed to a string as a spacedelimited sequence of tokens
and their frequencies, in descending frequency order. This is then MD5hashed.
See also [[http://lucene.apache.org/solr/api4_0_0BETA/org/apache/solr/update/processor/TextProfileSignature.htmlTextProfileSignature's
javadoc]]
