lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From syedfa <>
Subject Finding duplicate records from a result set
Date Wed, 16 Sep 2009 08:52:31 GMT

Dear Fellow Java/Lucene developers:

One annoyance that people have when searching for information online is the
occurance of duplicate records (i.e. multiple sites that carry news feeds
from the SAME news source like reuters or the associated press, and do not
provide any additional pieces of information).  This becomes an issue for
the user, as they would like to sift through all the duplicates and only
search through only the unique hits.  In my application that I am working
on, I realize that this is extremely common.  I have various books in xml
format that contain quotations, of which many are also listed in other
collections (i.e. the narrator, and the text of the quotation are almost
exactly the same.  Because the books have been translated into english by
different authors, the quotations from each collection differ slightly from
one another.  The quotations are being reported by multiple sources).  What
I would like to do in my application, using either Lucene, is to return a
set of results, such that if a user searches for a particular keyword (or
uses multiple keywords), then the result set should list any quote that is
reported from multiple sources only once, and underneath that quote, simply
list all the references from the other collections where it is found,
instead of listing the exact same quote in the result set, multiple times.
For example, if John Doe said, "blah blah blah", which is found in the
sources A, B, and C, if a user searched for "blah blah blah", then I want
the result set to show:
1. Narrator: John Doe
    Quote: "blah blah blah"
    Reference: A, B, C
and NOT like the following:
1. Narrator: John Doe
   Quote: "blah blah blah"
   Reference: A
2. Narrator: John Doe
   Quote: "blah blah blah"
   Reference: B
3. Narrator: John Doe
   Quote: "blah blah blah"
   Reference: C
I would imagine that this is a known issue in information retrieval, and I
am wondering if you have been able to solve/address this issue in Java using
Lucene?  What would you advise?  
Thanks to everyone for your time and patience.
View this message in context:
Sent from the Lucene - Java Users mailing list archive at

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message