lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From <apa...@lucene.com>
Subject Re: subclassing of IndexReader
Date Tue, 28 Oct 2003 11:51:43 GMT
>From Christoph Goller <goller@detego-software.de> on 25 Oct 2003:
> I reviewed your changes for subclassing of IndexReader

Thank you very much!  This makes me more comfortable with the changes.

> *)SegmentMerger.merge() now delivers the number of documents. Therefore,
> they donīt have to counted in IndexWriter.mergeSegments any longer.
> I changed this and all unit test still work including TestIndexWriter
> which tests exactly this number. So I think this small change is ok
> and I will commit it.

Sounds good to me.

> *)I am just curious. What is IndexReader.undeleteAll needed for?

In Nutch we have a rotating set of indexes.  For example, we might create a new index every
day.  Our crawler guarantees that pages will be re-indexed every 30 days, so we can, e.g.,
every day merge (or search w/o merging) the most recent 30 indexes.  So far so good.  But
many pages are clones of other pages: different urls with the same content.  So, each time
we deploy a new set of indexes we need to first perform duplicate detection to make sure that,
for each unique content, only a single url is present, that with the highest link analysis
score.  I implement this by first calling undeleteAll(), then perform the global duplicate
detection, deleting duplicates from their index.  Does this make sense?  Each day duplicate
detection must be repeated when a new index is added, but first all of the previously detected
duplicates must be cleared.

> *)SegmentsReader.undeleteAll does not set hasDeletions to false.
> I think this is a bug. Could you check please.

It indeed sounds like a bug.  I am on the road this week, reading email on a borrowed machine,
and cannot check this right now.  Thanks for catching this!

> *)The optimized implementation of SegmentTermDocs.seek(TermEnum enum)
> is essential in order to avoid unnecessary seek for termInfo in
> SegmentMerger.appendPostings(...).

You really did review this well!  That was the only tricky thing about this change, required
to make it perform well.  I'm impressed that you noticed it.

> The problem I see is that
> SegmentTermDocs.seek(TermEnum enum) is public and there is no test to
> assure that enum is from the same segment as SegmentTermDocs. I think
> such a test should be added. If you agree, I can do that.

That sounds like a good idea.  I am not good at error checking...

Thanks again for your detailed review.  The fixes you suggest all sound good to me.

Doug

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org


Mime
View raw message