lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael Ji <fji...@yahoo.com>
Subject Re: Document Duplication for Multiple Segment Merge
Date Fri, 14 Oct 2005 17:26:14 GMT
hi Yonik:

Does that mean when two documents has same MD5 content
in two different segments, IndexMerger.java  will keep
both of them?

When I look at the code of IndexSegment.java, it
handle MD5 dedupling by keeping the one with higher
document ID.

So, when refetching happens, the old segment should be
discarded totally. And, a strategy must be made in
such a way that each segment should relate to a
fetchlist with same interval time. Is it the way Nutch
handling refetching case?


Michael Ji,

--- Yonik Seeley <yseeley@gmail.com> wrote:

> There is no concept in Lucene of document identity
> linked to any fields of a
> document.
> You need to handle removal of duplicates yourself.
> 
> -Yonik
> Now hiring -- http://tinyurl.com/7m67g
> 
> 
> On 10/14/05, Michael Ji <fji_00@yahoo.com> wrote:
> >
> > hi,
> >
> > When Nutch's IndexMerger.java is called, the
> indexes
> > from multiple segment directories are merged to
> one
> > target directory.
> >
> > I wonder how lucene deals with the case when
> identical
> > documents existing in two segments. Is the older
> > document ( lower time stamp ) deleted?
> >
> > thanks,
> >
> > Michael Ji,
> >
> >
> >
> > __________________________________
> > Yahoo! Music Unlimited
> > Access over 1 million songs. Try it free.
> > http://music.yahoo.com/unlimited/
> >
> >
>
---------------------------------------------------------------------
> > To unsubscribe, e-mail:
> java-dev-unsubscribe@lucene.apache.org
> > For additional commands, e-mail:
> java-dev-help@lucene.apache.org
> >
> >
> 



		
__________________________________ 
Yahoo! Music Unlimited 
Access over 1 million songs. Try it free.
http://music.yahoo.com/unlimited/

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message