lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael McCandless <>
Subject Re: MergePolicy public but SegmentInfos package protected?
Date Fri, 27 Mar 2009 15:09:09 GMT
On Fri, Mar 27, 2009 at 9:48 AM, Marvin Humphrey <> wrote:

> Every indexer opens a PolyReader [analogous to MultiSegmentReader], even when
> there's no data in the index, or a single segment. (I modified PolyReader so
> that 1-segment and 0-segment states were officially valid for this purpose.)


> PolyReader is a public class, as is SegReader, and PolyReader allows access to
> its subreaders via a Get_Seg_Readers() accessor.  Once we have SegReaders in
> hand, we can get at per-segment doc counts and deletion counts, which I think
> will be sufficient for planning a merge.

OK.  Whereas in Lucene neither MultiSegmentReader nor SegmentReader is public.

> As for the actual implementation of MergePolicy, I haven't prototyped that out
> yet.  Right now in KS, the infrastructure is reasonably primitive:
> IndexManager has a method called SegReaders_To_Merge() which accepts a
> PolyReader as an argument and returns an array of SegReaders representing
> content that should be merged.

KS does the fibonacci merge policy right?

>> Even though Lucy's SegmentReader is lighter weight, it still seems
>> like you shouldn't be opening them in the writer (except for realtime
>> search)?
> I don't see why not.

But it still ties up resources?  EG mmap uses up chunks of your
address space (possibly important on 32 bit machines, eg if you want a
large ram buffer in the writer), opening files takes time &
descriptors, etc.

> When I built this into KS, I thought I was imitating your plan for Lucene.  :)

I think for the time being we'll still allow "pure IndexWriter".

>> Are you going to simply make the segment metadata public?
> If you're referring to Segment's Fetch_Metadata method, it's public.  If it
> weren't, then plugin components couldn't use it, which would be unfortunate.
> I think we ought to make it easy for plugins to store their metadata as JSON
> inside segmeta.json file rather than resort to writing it in their own
> proprietary formats.


> In theory, any index component can peer into another's metadata, just by
> invoking Seg_Fetch_Metadata(segment, other_component_name).  That would be a
> bad idea, though, because what a component might choose to store there isn't
> public.

I think it's fine to not guard against it.  It's the same "we are all
consenting adults" approach that Python takes.

> I do expect that we will codify what data will be present in parts of
> segmeta.json as part of the official Lucy file format spec, however.  If you
> were both dumb and determined, you could duplicate all the version checking
> code and adhere to the spec, making it possible to (maybe) safely interpret
> that data.



To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message