lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From <>
Subject Re: subclassing of IndexReader
Date Tue, 28 Oct 2003 11:51:43 GMT
>From Christoph Goller <> on 25 Oct 2003:
> I reviewed your changes for subclassing of IndexReader

Thank you very much!  This makes me more comfortable with the changes.

> *)SegmentMerger.merge() now delivers the number of documents. Therefore,
> they donīt have to counted in IndexWriter.mergeSegments any longer.
> I changed this and all unit test still work including TestIndexWriter
> which tests exactly this number. So I think this small change is ok
> and I will commit it.

Sounds good to me.

> *)I am just curious. What is IndexReader.undeleteAll needed for?

In Nutch we have a rotating set of indexes.  For example, we might create a new index every
day.  Our crawler guarantees that pages will be re-indexed every 30 days, so we can, e.g.,
every day merge (or search w/o merging) the most recent 30 indexes.  So far so good.  But
many pages are clones of other pages: different urls with the same content.  So, each time
we deploy a new set of indexes we need to first perform duplicate detection to make sure that,
for each unique content, only a single url is present, that with the highest link analysis
score.  I implement this by first calling undeleteAll(), then perform the global duplicate
detection, deleting duplicates from their index.  Does this make sense?  Each day duplicate
detection must be repeated when a new index is added, but first all of the previously detected
duplicates must be cleared.

> *)SegmentsReader.undeleteAll does not set hasDeletions to false.
> I think this is a bug. Could you check please.

It indeed sounds like a bug.  I am on the road this week, reading email on a borrowed machine,
and cannot check this right now.  Thanks for catching this!

> *)The optimized implementation of enum)
> is essential in order to avoid unnecessary seek for termInfo in
> SegmentMerger.appendPostings(...).

You really did review this well!  That was the only tricky thing about this change, required
to make it perform well.  I'm impressed that you noticed it.

> The problem I see is that
> enum) is public and there is no test to
> assure that enum is from the same segment as SegmentTermDocs. I think
> such a test should be added. If you agree, I can do that.

That sounds like a good idea.  I am not good at error checking...

Thanks again for your detailed review.  The fixes you suggest all sound good to me.


To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message