lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Petite Abeille <petite_abei...@me.com>
Subject Re: Bet you didn't know Lucene can...
Date Mon, 31 Oct 2011 20:42:28 GMT

On Oct 31, 2011, at 9:32 PM, Andrzej Bialecki wrote:

> similarity-preserving hash function was calculated on each sentence, and the hash was
added as a field. The property of the hash was that similar documents (sentences) would produce
a similar hash, with only some bit-level perturbation. The challenge was to find a ranked
list of possible duplicates with similar (not exact same) hashes, which in this case meant
to find a ranked list of documents that have the smallest bit-level distance in their hashes
from the query hash.
> 
> The solution is described in SOLR-1918 - Bit-wise scoring field type.

In other words, a simhash, no?

Similarity Estimation Techniques from Rounding Algorithms
http://www.cs.princeton.edu/courses/archive/spr04/cos598B/bib/CharikarEstim.pdf

http://www.matpalm.com/resemblance/simhash/


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message