lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From syedfa <>
Subject Re: Finding duplicate records from a result set
Date Wed, 16 Sep 2009 10:04:30 GMT

Thanks very much Henok for your reply.  I would be very much interested in
your thesis, and any code that you may provide.  Is your thesis published
online?  Is it in english?  Your approach seems very interesting, and I
would be very interested in looking at the details.  Some ideas I had were
using the cosine similarity approach, or perhaps n-grams, since the
quotations that I am comparing are not as big as an entire document/article.

I would be very happy to hear from you.  My email address is

Also, were you able to implement your solution online?  If you can provide
me with you a url, that would be great.

I look forward to hearing from you soon.


henok sahilu wrote:
> i have a thesis work which i have done. it was on lega documents. the XML
> IR systems are very susceptible for producing duplicate or near duplicate
> contents (not in concept, but in textual content ).
> here is what i did .
> i tag each article content in the legal documents, with their status, and
> their relationship with other article contents.
>  and then write a parer that will read this tags and index the contents
> therein. 
> and write a re-ranking algorithm that works based on the staus information
> of article contents and their relationship.
> for example 
> the one that contians an active law will be boosted, because it is the
> active law that prevails in matters.
> and some times article are replaced by other articles , in this case
> rather than presenting both them (which can result in duplicates ) i
> compare new terms used in the process of replacing with the query terms
> and boost the replacing article content. or i compare the terms that are
> exclusively used by the old article content with the query terms and boost
> the old article content. 
> the repealed article contents are downwwited by some factor rather than
> presenting their them along with their other version . 
> this is what i have done to reduce the number of duplicating or near
> duplicating search results to users.other wise the user can waste
> considerable time inspecting which is duplicate and which is not .
> i can give u the the abstract and come codes (in java) id u want 
> i think this might help 
> henok 
> --- On Wed, 9/16/09, syedfa <> wrote:
> From: syedfa <>
> Subject: Finding duplicate records from a result set
> To:
> Date: Wednesday, September 16, 2009, 1:52 AM
> Dear Fellow Java/Lucene developers:
> One annoyance that people have when searching for information online is
> the
> occurance of duplicate records (i.e. multiple sites that carry news feeds
> from the SAME news source like reuters or the associated press, and do not
> provide any additional pieces of information).  This becomes an issue for
> the user, as they would like to sift through all the duplicates and only
> search through only the unique hits.  In my application that I am working
> on, I realize that this is extremely common.  I have various books in xml
> format that contain quotations, of which many are also listed in other
> collections (i.e. the narrator, and the text of the quotation are almost
> exactly the same.  Because the books have been translated into english by
> different authors, the quotations from each collection differ slightly
> from
> one another.  The quotations are being reported by multiple sources). 
> What
> I would like to do in my application, using either Lucene, is to return a
> set of results, such that if a user searches for a particular keyword (or
> uses multiple keywords), then the result set should list any quote that is
> reported from multiple sources only once, and underneath that quote,
> simply
> list all the references from the other collections where it is found,
> instead of listing the exact same quote in the result set, multiple times.
> For example, if John Doe said, "blah blah blah", which is found in the
> sources A, B, and C, if a user searched for "blah blah blah", then I want
> the result set to show:
> 1. Narrator: John Doe
>     Quote: "blah blah blah"
>     Reference: A, B, C
> and NOT like the following:
> 1. Narrator: John Doe
>    Quote: "blah blah blah"
>    Reference: A
> 2. Narrator: John Doe
>    Quote: "blah blah blah"
>    Reference: B
> 3. Narrator: John Doe
>    Quote: "blah blah blah"
>    Reference: C
> I would imagine that this is a known issue in information retrieval, and I
> am wondering if you have been able to solve/address this issue in Java
> using
> Lucene?  What would you advise?  
> Thanks to everyone for your time and patience.
> -- 
> View this message in context:
> Sent from the Lucene - Java Users mailing list archive at
> ---------------------------------------------------------------------
> To unsubscribe, e-mail:
> For additional commands, e-mail:

View this message in context:
Sent from the Lucene - Java Users mailing list archive at

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message