lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Nikola Smolenski <smolen...@unilib.rs>
Subject Re: Grouping by simhash signature
Date Thu, 03 Dec 2015 10:12:01 GMT
On Wed, Dec 2, 2015 at 9:00 PM, Nickolay41189 <Klin892006@yandex.ru> wrote:
> I try to implement NearDup detection by  SimHash
> <https://moz.com/devblog/near-duplicate-detection/>   algorithm in Solr.
> Let's say:
> 1) each document has a field /simhash_signature/ that stores a sequence of
> bits.
> 2) that in order to be considered NearDup, documents must have, at most, 2
> bits that differ in /simhash_signature/
>
>
> *My question:*
> How can I get groups of nearDup by /simhash_signature/?
>
> *Examples:*
>   Input:
>     Doc A = 0001000
>     Doc B = 1000000
>     Doc C = 1111111
>     Doc D = 0101000
>   Output:
>     A -> {B, D}
>     B -> {A}
>     C -> {}
>     D -> {A}

I'm not sure if this is the best solution (or, indeed, if it is at all
possible), but maybe you could store the bit fields as strings, then
use strdist function to find Levenshtein distance between the strings
and group by that.

-- 
Nikola Smolenski

University of Belgrade
University library ''Svetozar Markovic''

Mime
View raw message