lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From henok sahilu <>
Subject Re: Finding duplicate records from a result set
Date Wed, 16 Sep 2009 09:16:06 GMT
i have a thesis work which i have done. it was on lega documents. the XML IR systems are very
susceptible for producing duplicate or near duplicate contents (not in concept, but in textual
content ).
here is what i did .
i tag each article content in the legal documents, with their status, and their relationship
with other article contents.
 and then write a parer that will read this tags and index the contents therein. 
and write a re-ranking algorithm that works based on the staus information of article contents
and their relationship.
for example 
the one that contians an active law will be boosted, because it is the active law that prevails
in matters.
and some times article are replaced by other articles , in this case rather than presenting
both them (which can result in duplicates ) i compare new terms used in the process of replacing
with the query terms and boost the replacing article content. or i compare the terms that
are exclusively used by the old article content with the query terms and boost the old article
the repealed article contents are downwwited by some factor rather than presenting their them
along with their other version . 
this is what i have done to reduce the number of duplicating or near duplicating search results
to users.other wise the user can waste considerable time inspecting which is duplicate and
which is not .
i can give u the the abstract and come codes (in java) id u want 
i think this might help 

--- On Wed, 9/16/09, syedfa <> wrote:

From: syedfa <>
Subject: Finding duplicate records from a result set
Date: Wednesday, September 16, 2009, 1:52 AM

Dear Fellow Java/Lucene developers:

One annoyance that people have when searching for information online is the
occurance of duplicate records (i.e. multiple sites that carry news feeds
from the SAME news source like reuters or the associated press, and do not
provide any additional pieces of information).  This becomes an issue for
the user, as they would like to sift through all the duplicates and only
search through only the unique hits.  In my application that I am working
on, I realize that this is extremely common.  I have various books in xml
format that contain quotations, of which many are also listed in other
collections (i.e. the narrator, and the text of the quotation are almost
exactly the same.  Because the books have been translated into english by
different authors, the quotations from each collection differ slightly from
one another.  The quotations are being reported by multiple sources).  What
I would like to do in my application, using either Lucene, is to return a
set of results, such that if a user searches for a particular keyword (or
uses multiple keywords), then the result set should list any quote that is
reported from multiple sources only once, and underneath that quote, simply
list all the references from the other collections where it is found,
instead of listing the exact same quote in the result set, multiple times.
For example, if John Doe said, "blah blah blah", which is found in the
sources A, B, and C, if a user searched for "blah blah blah", then I want
the result set to show:
1. Narrator: John Doe
    Quote: "blah blah blah"
    Reference: A, B, C
and NOT like the following:
1. Narrator: John Doe
   Quote: "blah blah blah"
   Reference: A
2. Narrator: John Doe
   Quote: "blah blah blah"
   Reference: B
3. Narrator: John Doe
   Quote: "blah blah blah"
   Reference: C
I would imagine that this is a known issue in information retrieval, and I
am wondering if you have been able to solve/address this issue in Java using
Lucene?  What would you advise?  
Thanks to everyone for your time and patience.
View this message in context:
Sent from the Lucene - Java Users mailing list archive at

To unsubscribe, e-mail:
For additional commands, e-mail:

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message