lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "John Casey" <john.ca...@gmail.com>
Subject Re: near duplicates
Date Wed, 18 Oct 2006 23:50:27 GMT
On 10/18/06, Isabel Drost <idrost@htwm.de> wrote:
>
> Find Me wrote:
> > How to eliminate near duplicates from the index? Someone suggested that
> I
> > could look at the TermVectors and do a comparision to remove the
> > duplicates.
>
> As an alternative you could also have a look at the paper "Detecting
> Phrase-Level Duplication on the World Wide Web" by Dennis Fetterly, Mark
> Manasse, Marc Najork.


Another good reference would be Soumen Chakrabarti's reference book, "Mining
the Web - Discovering Knowledge from Hypertext Data",2003 and the section on
shingling and the elimination of near duplicates. Of course I think this
works at the document level rather than at the term vector level but it
might be useful to prevent duplicate documents from being indexed
altogether.

> One major problem with this is the structure of the document is
> > no longer important. Are there any obvious pitfalls? For example:
> Document
> > A being a subset of Document B but in no particular order.
>
> I think this case is pretty unlikely. But I am not sure whether you can
> detect
> near duplicates by only comparing term-document vectors. There might be
> problems with documents with slightly changed words, words that were
> replaced
> with synonyms...
>
> However, if you want to keep some information on the word order, you might
> consider comparing n-gram document vectors. That is, each dimension in the
>
> vector does not only represent one word but a sequence of 2, 3, 4, 5...
> words.



would this involve something like a window of 2-5 words around a particular
term in a document?

Cheers,
> Isabel
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message