lucene-solr-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <wikidi...@apache.org>
Subject [Solr Wiki] Update of "TextProfileSignature" by JoelNothman
Date Tue, 30 Oct 2012 07:16:53 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Solr Wiki" for change notification.

The "TextProfileSignature" page has been changed by JoelNothman:
http://wiki.apache.org/solr/TextProfileSignature

Comment:
description of algorithm

New page:
TextProfileSignature calculates a fuzzy hash of textual fields for [[Deduplication]], and
may be incorporated using a SignatureUpdateProcessorFactory definition including the following
parameters:

|| Name || Type || Description || Default value ||
|| `minTokenLen` || int || The minimum token length to consider || 2 ||
|| `quantRate` || float || When multiplied by the maximum token frequency, this determines
count quantization || .01 ||

The signature calculation proceeds as follows:

=== Tokenization and normalization ===

* Tokens are contiguous alphanumeric characters
* Normalized to lowercase
* Discarded if shorter than `minTokenLen`

Tokens are then counted, tracking the frequency `maxFreq` of the most frequent token.

=== Count quantization ===

A value `quant` is calculated as follows:

|| || 1 || if `maxFreq` <= 1 ||
||`quant` := || 2 || if round(`maxFreq * quantRate`) < 2 ||
|| || round(`maxFreq * quantRate`) || otherwise ||

Token frequencies are then rounded down to the nearest multiple of `quant`, and any token
occurring less than `quant` times is discarded.

=== Hashing ===

The set of frequencies is transformed to a string as a space-delimited sequence of tokens
and their frequencies, in descending frequency order. This is then MD5-hashed.

See also [[http://lucene.apache.org/solr/api-4_0_0-BETA/org/apache/solr/update/processor/TextProfileSignature.html|TextProfileSignature's
javadoc]]

Mime
View raw message