Return-Path: Delivered-To: apmail-lucene-java-user-archive@www.apache.org Received: (qmail 36152 invoked from network); 26 Oct 2006 06:53:39 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 26 Oct 2006 06:53:39 -0000 Received: (qmail 24745 invoked by uid 500); 24 Oct 2006 14:09:02 -0000 Delivered-To: apmail-lucene-java-user-archive@lucene.apache.org Received: (qmail 24696 invoked by uid 500); 24 Oct 2006 14:09:01 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Received: (qmail 24660 invoked by uid 99); 24 Oct 2006 14:09:01 -0000 Received: from herse.apache.org (HELO herse.apache.org) (140.211.11.133) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 24 Oct 2006 07:09:01 -0700 X-ASF-Spam-Status: No, hits=0.0 required=10.0 tests= X-Spam-Check-By: apache.org Received-SPF: pass (herse.apache.org: local policy) Received: from [201.216.245.26] (HELO sledge.tera-code.com.ar) (201.216.245.26) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 24 Oct 2006 07:08:42 -0700 Received: from [192.168.1.54] (devel.tera-code.com.ar [201.216.245.25]) by sledge.tera-code.com.ar (8.13.4/8.13.4/Debian-3sarge1) with ESMTP id k9ODsGqr005076 for ; Tue, 24 Oct 2006 10:54:16 -0300 Message-ID: <453E1E4F.1080300@tera-code.com.ar> Date: Tue, 24 Oct 2006 11:08:15 -0300 From: Beto Siless Organization: tera-code User-Agent: Thunderbird 1.5.0.7 (Windows/20060909) MIME-Version: 1.0 To: java-user@lucene.apache.org Subject: Re: near duplicates References: In-Reply-To: Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit X-Virus-Checked: Checked by ClamAV on apache.org Hi Karl! I'm interested in near duplicate detection based on termFreqVectos. Now I'm comparing all documents with each other (calculating the angle)... Is there a way to avoid that? Thanks! Beto karl wettin wrote: > > 17 okt 2006 kl. 17.54 skrev Find Me: > >> How to eliminate near duplicates from the index? > > Oh, one more thing. You should probably look at the norms in order to > avoid comparing all documents to each other. > > > > ------------------------------------------------------------------------ > > No virus found in this incoming message. > Checked by AVG Free Edition. > Version: 7.1.408 / Virus Database: 268.13.4/477 - Release Date: 10/16/2006 --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org For additional commands, e-mail: java-user-help@lucene.apache.org