lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Li Li <fancye...@gmail.com>
Subject Re: Detecting duplicates
Date Sat, 05 Mar 2011 07:09:43 GMT
it's the problem of near duplication detection. there are many papers
addressing this problem. methods like simhash are used.

2011/3/5 Mark <static.void.dev@gmail.com>

> Is there a way one could detect duplicates (say by using some unique hash
> of certain fields) and marking a document as a duplicate but not remove it.
>
> Here is an example:
>
> Doc 1) This is my test
> Doc 2) This is my test
> Doc 3) Another test
> Doc 4) This is my test
>
> Doc 1 and 3 should be considered unique whereas 2 and 4 should be marked as
> duplicates (of doc 1).
>
> Can this be easily accomplished?
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message