lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael McCandless <>
Subject Re: MergePolicy public but SegmentInfos package protected?
Date Fri, 27 Mar 2009 09:26:21 GMT
On Thu, Mar 26, 2009 at 9:51 PM, Marvin Humphrey <> wrote:

>> eg querying whether
>> compound file format is in use, whether separate norms are stored,
>> "get me total size in bytes of all files" (or maybe just "get me all
>> files", plus utility method somewhere to add up the sizes), so this
>> approach seems doable.
> Do you really need all that?  I think the crucial info is already available:
>  * The number of docs in each segment.
>  * The number of deletions in each segment, allowing you to calculate the
>    deletion percentage.

I'm just going w/ the info that Log*MergePolicy use today -- checking
CFS, separate dels & norms, is done for "isOptimized"; oh, actually
IndexReader has an isOptimized(), which we could simply use, instead.

> I think it's reasonable to assume an average distribution of document sizes
> across segments.  Sure, that'll be wrong at the long tail of the curve, but
> most of the time it will be right -- and even when it's not, it won't cause
> big problems.

Yeah this might be acceptable in practice, though users who add a
bunch of tiny docs followed by a bunch of big docs (or v/v) may see
poor merge choices.  Maybe in practice it wouldn't be a big deal.

>> But: we don't yet have IndexWriter holding open a reader for every
>> segment.  We are working on realtime search (LUCENE-1516), but even
>> then, if you don't ask for a realtime reader from IndexWriter, it
>> won't hold open SegmentReaders for all segments.
> Yeah, that's gonna be a bigger problem.  :(  It's cake to give Lucy's indexer
> a reader, because opening readers is cheap.  But the Lucene heavy-IndexReader
> model messes that up -- IndexWriter has traditionally been a fast class to
> open.

Right, this one seems like the deal breaker: IndexWriter should not in
general go and pool readers on all segments in the index.


To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message