Mailing-List: contact lucene-user-help@jakarta.apache.org; run by ezmlm
Precedence: bulk
Reply-To: "Lucene Users List" <lucene-user@jakarta.apache.org>
Mime-Version: 1.0 (Apple Message framework v612)
In-Reply-To: <6.0.0.22.0.20040308163336.02520d30@mail.visionstudio.com>
References: <200402292254.i1TMsWdA020794@server0027.freedom2surf.net>
 <6.0.0.22.0.20040308163336.02520d30@mail.visionstudio.com>
Content-Type: text/plain; charset=US-ASCII; format=flowed
Message-Id: <288F4E94-714B-11D8-9720-000393A564E6@ehatchersolutions.com>
Content-Transfer-Encoding: 7bit
From: Erik Hatcher <erik@ehatchersolutions.com>
Subject: Re: Filtering out duplicate documents...
Date: Mon, 8 Mar 2004 16:54:22 -0500
To: "Lucene Users List" <lucene-user@jakarta.apache.org>

My impression is the new term vector support should at least make this 
type of comparison feasible in some manner.  I'd be interested to see 
what you come up with if you give this a try.  You will need the latest 
CVS codebase.

	Erik


On Mar 8, 2004, at 4:37 PM, Michael Giles wrote:

> I'm looking for a way to filter out duplicate documents from an index 
> (either while indexing, or after the fact).  It seems like there 
> should be an approach of comparing the terms for two documents, but 
> I'm wondering if any other folks (i.e. nutch) have come up with a 
> solution to this problem.
>
> Obviously you can compute the Levenstein distance on the text, but 
> that is way too computationally intensive to scale.  So the goal is to 
> find something that would be workable in a production system.  For 
> example, a given NYT article, and its printer friendly version should 
> be deemed to be the same.
>
> -Mike
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org