Return-Path: Delivered-To: apmail-jakarta-lucene-user-archive@www.apache.org Received: (qmail 61451 invoked from network); 8 Mar 2004 21:54:38 -0000 Received: from daedalus.apache.org (HELO mail.apache.org) (208.185.179.12) by minotaur-2.apache.org with SMTP; 8 Mar 2004 21:54:38 -0000 Received: (qmail 71430 invoked by uid 500); 8 Mar 2004 21:54:20 -0000 Delivered-To: apmail-jakarta-lucene-user-archive@jakarta.apache.org Received: (qmail 71403 invoked by uid 500); 8 Mar 2004 21:54:20 -0000 Mailing-List: contact lucene-user-help@jakarta.apache.org; run by ezmlm Precedence: bulk List-Unsubscribe: List-Subscribe: List-Help: List-Post: List-Id: "Lucene Users List" Reply-To: "Lucene Users List" Delivered-To: mailing list lucene-user@jakarta.apache.org Received: (qmail 71389 invoked from network); 8 Mar 2004 21:54:20 -0000 Received: from unknown (HELO c000.snv.cp.net) (209.228.34.189) by daedalus.apache.org with SMTP; 8 Mar 2004 21:54:20 -0000 Received: (cpmta 12335 invoked from network); 8 Mar 2004 13:54:25 -0800 Received: from 216.12.13.89 (HELO ?192.168.0.13?) by smtp.hatcher.net (209.228.34.189) with SMTP; 8 Mar 2004 13:54:25 -0800 X-Sent: 8 Mar 2004 21:54:25 GMT Mime-Version: 1.0 (Apple Message framework v612) In-Reply-To: <6.0.0.22.0.20040308163336.02520d30@mail.visionstudio.com> References: <200402292254.i1TMsWdA020794@server0027.freedom2surf.net> <6.0.0.22.0.20040308163336.02520d30@mail.visionstudio.com> Content-Type: text/plain; charset=US-ASCII; format=flowed Message-Id: <288F4E94-714B-11D8-9720-000393A564E6@ehatchersolutions.com> Content-Transfer-Encoding: 7bit From: Erik Hatcher Subject: Re: Filtering out duplicate documents... Date: Mon, 8 Mar 2004 16:54:22 -0500 To: "Lucene Users List" X-Mailer: Apple Mail (2.612) X-Spam-Rating: daedalus.apache.org 1.6.2 0/1000/N X-Spam-Rating: minotaur-2.apache.org 1.6.2 0/1000/N My impression is the new term vector support should at least make this type of comparison feasible in some manner. I'd be interested to see what you come up with if you give this a try. You will need the latest CVS codebase. Erik On Mar 8, 2004, at 4:37 PM, Michael Giles wrote: > I'm looking for a way to filter out duplicate documents from an index > (either while indexing, or after the fact). It seems like there > should be an approach of comparing the terms for two documents, but > I'm wondering if any other folks (i.e. nutch) have come up with a > solution to this problem. > > Obviously you can compute the Levenstein distance on the text, but > that is way too computationally intensive to scale. So the goal is to > find something that would be workable in a production system. For > example, a given NYT article, and its printer friendly version should > be deemed to be the same. > > -Mike > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org > For additional commands, e-mail: lucene-user-help@jakarta.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org For additional commands, e-mail: lucene-user-help@jakarta.apache.org