Return-Path: Delivered-To: apmail-lucene-java-user-archive@www.apache.org Received: (qmail 584 invoked from network); 17 Oct 2006 15:54:48 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (209.237.227.199) by minotaur.apache.org with SMTP; 17 Oct 2006 15:54:48 -0000 Received: (qmail 58133 invoked by uid 500); 17 Oct 2006 15:54:36 -0000 Delivered-To: apmail-lucene-java-user-archive@lucene.apache.org Received: (qmail 58107 invoked by uid 500); 17 Oct 2006 15:54:36 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Received: (qmail 58083 invoked by uid 99); 17 Oct 2006 15:54:36 -0000 Received: from asf.osuosl.org (HELO asf.osuosl.org) (140.211.166.49) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 17 Oct 2006 08:54:35 -0700 X-ASF-Spam-Status: No, hits=3.4 required=10.0 tests=DNS_FROM_RFC_ABUSE,HTML_10_20,HTML_MESSAGE,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (asf.osuosl.org: domain of findmath@gmail.com designates 64.233.182.190 as permitted sender) Received: from [64.233.182.190] (HELO nf-out-0910.google.com) (64.233.182.190) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 17 Oct 2006 08:54:29 -0700 Received: by nf-out-0910.google.com with SMTP id b2so317561nfe for ; Tue, 17 Oct 2006 08:54:07 -0700 (PDT) DomainKey-Signature: a=rsa-sha1; q=dns; c=nofws; s=beta; d=gmail.com; h=received:message-id:date:from:to:subject:mime-version:content-type; b=ZE7/e5eb+h/TsgogetWAPXCjOGpwwCdMCzYBT/AQbakA1f+i5tQ12YfINwMnKtWDO45irupy4CXl23j4TJLH0ZnPn6s/qw/oM1EFLejD74Ai2tPTo8wXZjiyga0ZtpEXRB1GL/xMkRxbv7PhZboJ88/dWWRj/v7w+y7q7v2rIU4= Received: by 10.82.120.15 with SMTP id s15mr1551666buc; Tue, 17 Oct 2006 08:54:07 -0700 (PDT) Received: by 10.49.60.17 with HTTP; Tue, 17 Oct 2006 08:54:07 -0700 (PDT) Message-ID: Date: Tue, 17 Oct 2006 10:54:07 -0500 From: "Find Me" To: java-user@lucene.apache.org, nutch-user@lucene.apache.org Subject: near duplicates MIME-Version: 1.0 Content-Type: multipart/alternative; boundary="----=_Part_178024_105693.1161100447449" X-Virus-Checked: Checked by ClamAV on apache.org X-Spam-Rating: minotaur.apache.org 1.6.2 0/1000/N ------=_Part_178024_105693.1161100447449 Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Content-Disposition: inline How to eliminate near duplicates from the index? Someone suggested that I could look at the TermVectors and do a comparision to remove the duplicates. One major problem with this is the structure of the document is no longer important. Are there any obvious pitfalls? For example: Document A being a subset of Document B but in no particular order. Nutch's DeleteDuplicates class is useful only when the documents are identical with respect to either URL or the content. ------=_Part_178024_105693.1161100447449--