lucene-solr-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Mark Miller (JIRA)" <j...@apache.org>
Subject [jira] Commented: (SOLR-799) Add support for hash based exact/near duplicate document handling
Date Wed, 08 Oct 2008 00:02:44 GMT

    [ https://issues.apache.org/jira/browse/SOLR-799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12637719#action_12637719
] 

Mark Miller commented on SOLR-799:
----------------------------------

Thanks for the review Andrzej. I've made the first two changes (I put at the top of TextProfileSignature
that its 'borrowed' from Nutch and grabbed Hadoops MD5Hash class and stripped its Hadoop dependencies)
and I'm investigating change 3. I'll put up another patch in a couple days.

- Mark

> Add support for hash based exact/near duplicate document handling
> -----------------------------------------------------------------
>
>                 Key: SOLR-799
>                 URL: https://issues.apache.org/jira/browse/SOLR-799
>             Project: Solr
>          Issue Type: New Feature
>          Components: update
>            Reporter: Mark Miller
>            Priority: Minor
>         Attachments: SOLR-799.patch
>
>
> Hash based duplicate document detection is efficient and allows for blocking as well
as field collapsing. Lets put it into solr. 
> http://wiki.apache.org/solr/Deduplication

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message