Return-Path: Delivered-To: apmail-lucene-solr-dev-archive@locus.apache.org Received: (qmail 65771 invoked from network); 8 Oct 2008 00:03:42 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 8 Oct 2008 00:03:42 -0000 Received: (qmail 14644 invoked by uid 500); 8 Oct 2008 00:03:34 -0000 Delivered-To: apmail-lucene-solr-dev-archive@lucene.apache.org Received: (qmail 14596 invoked by uid 500); 8 Oct 2008 00:03:34 -0000 Mailing-List: contact solr-dev-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: solr-dev@lucene.apache.org Delivered-To: mailing list solr-dev@lucene.apache.org Received: (qmail 14576 invoked by uid 99); 8 Oct 2008 00:03:34 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 07 Oct 2008 17:03:34 -0700 X-ASF-Spam-Status: No, hits=-1999.9 required=10.0 tests=ALL_TRUSTED,DNS_FROM_SECURITYSAGE X-Spam-Check-By: apache.org Received: from [140.211.11.140] (HELO brutus.apache.org) (140.211.11.140) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 08 Oct 2008 00:02:39 +0000 Received: from brutus (localhost [127.0.0.1]) by brutus.apache.org (Postfix) with ESMTP id 86E7F234C218 for ; Tue, 7 Oct 2008 17:02:44 -0700 (PDT) Message-ID: <870958535.1223424164551.JavaMail.jira@brutus> Date: Tue, 7 Oct 2008 17:02:44 -0700 (PDT) From: "Mark Miller (JIRA)" To: solr-dev@lucene.apache.org Subject: [jira] Commented: (SOLR-799) Add support for hash based exact/near duplicate document handling In-Reply-To: <1661911285.1223088224170.JavaMail.jira@brutus> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-Virus-Checked: Checked by ClamAV on apache.org [ https://issues.apache.org/jira/browse/SOLR-799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12637719#action_12637719 ] Mark Miller commented on SOLR-799: ---------------------------------- Thanks for the review Andrzej. I've made the first two changes (I put at the top of TextProfileSignature that its 'borrowed' from Nutch and grabbed Hadoops MD5Hash class and stripped its Hadoop dependencies) and I'm investigating change 3. I'll put up another patch in a couple days. - Mark > Add support for hash based exact/near duplicate document handling > ----------------------------------------------------------------- > > Key: SOLR-799 > URL: https://issues.apache.org/jira/browse/SOLR-799 > Project: Solr > Issue Type: New Feature > Components: update > Reporter: Mark Miller > Priority: Minor > Attachments: SOLR-799.patch > > > Hash based duplicate document detection is efficient and allows for blocking as well as field collapsing. Lets put it into solr. > http://wiki.apache.org/solr/Deduplication -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.