lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mark <>
Subject Detecting duplicates
Date Sat, 05 Mar 2011 06:49:28 GMT
Is there a way one could detect duplicates (say by using some unique 
hash of certain fields) and marking a document as a duplicate but not 
remove it.

Here is an example:

Doc 1) This is my test
Doc 2) This is my test
Doc 3) Another test
Doc 4) This is my test

Doc 1 and 3 should be considered unique whereas 2 and 4 should be marked 
as duplicates (of doc 1).

Can this be easily accomplished?

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message