lucene-solr-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <wikidi...@apache.org>
Subject [Solr Wiki] Update of "TextProfileSignature" by JoelNothman
Date Tue, 30 Oct 2012 09:37:55 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Solr Wiki" for change notification.

The "TextProfileSignature" page has been changed by JoelNothman:
http://wiki.apache.org/solr/TextProfileSignature?action=diff&rev1=2&rev2=3

Comment:
some analysis

  
  See also [[http://lucene.apache.org/solr/api-4_0_0-BETA/org/apache/solr/update/processor/TextProfileSignature.html|TextProfileSignature's
javadoc]]
  
+ == Implications and limitations ==
+ 
+ Though this matches two texts approximately, it is still based on exactly matching a single
hash. It may fail to match documents that differ by exactly one word, if that word's frequency
changes from `k * quant - 1` to `k * quant`.
+ 
+ Words appearing once are ignored unless the text consists only of words appearing once.
Hence, "the cat sat on a mat" will hash distinctly to "the cat sat on the mat".
+ 
+ For the default `quantRate` (0.01), quant will exceed 2 only if the most frequent word occurs
`maxFreq >= 251` times.
+ 
+ These properties all suggest that TextProfileSignature is brittle for short texts.
+ 
+ TextProfileSignature operates on raw text, without the filtering provided by Analyzers,
and hence will fail to ignore HTML, normalize for diacritics, stem, or incorporate the relative
importance of different tokens, etc.
+ 

Mime
View raw message