lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andrzej Bialecki ...@getopt.org>
Subject Re: Bet you didn't know Lucene can...
Date Tue, 01 Nov 2011 00:32:26 GMT
On 31/10/2011 21:42, Petite Abeille wrote:
>
> On Oct 31, 2011, at 9:32 PM, Andrzej Bialecki wrote:
>
>> similarity-preserving hash function was calculated on each sentence, and the hash
was added as a field. The property of the hash was that similar documents (sentences) would
produce a similar hash, with only some bit-level perturbation. The challenge was to find a
ranked list of possible duplicates with similar (not exact same) hashes, which in this case
meant to find a ranked list of documents that have the smallest bit-level distance in their
hashes from the query hash.
>>
>> The solution is described in SOLR-1918 - Bit-wise scoring field type.
>
> In other words, a simhash, no?
>
> Similarity Estimation Techniques from Rounding Algorithms
> http://www.cs.princeton.edu/courses/archive/spr04/cos598B/bib/CharikarEstim.pdf
>
> http://www.matpalm.com/resemblance/simhash/

Yes, you could use this. In that project we used a different 
application-specific hash.


-- 
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message