lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Robert Muir (JIRA)" <>
Subject [jira] [Commented] (LUCENE-4055) Refactor SegmentInfo / FieldInfo to make them extensible
Date Tue, 22 May 2012 17:25:41 GMT


Robert Muir commented on LUCENE-4055:

Just some updates from the work in the branch (scary changes but proceeding nicely since Mike
jumped in and did a lot of it).
Here's a list of the current progress:

* on disk, the segments_N is reduced to the stuff that actually is per-commit: a list of segments
and deleted gens/counts, etc.
* per-segment metadata (doc count, diagnostics, etc) that is write-once is encoded by the
codec, e.g. for 4.0's codec this is in the .si file.
* removed backwards-seeking on segments_N. so appendingcodec still works but doesn't need
any special hacks.
* flush/merge order is changed so that fieldinfos are written last so codecs have a chance
to add metadata to it.
* fieldinfo has a "codec metadata" api that codec components can use, and that metadata will
be available on reading the segment. this metadata 
  is for the codec to use to extend fieldinfo, its not carried along during merge or anything.

* PerFieldPostingsFormat is changed to use the fieldinfo metadata api rather than a separate
.per file (e.g. it records that the "id" field uses Pulsing).
* all the hairiness involving files() is really nice now, instead we simply track which files
were created, and add them to the .si file. Previously
  there was a lot of logic to compute this in a symmetric way at both read and write time,
and if you had a bug, your punishment was FNFE.

not yet done:
* add metadata api to segmentinfo too, so that codec components can record per-segment information
that they care about.
* see if we can implement 3.x's shared doc stores support with segmentinfo metadata api. This
is tricky to do and for addIndexes/indexSplitter etc which
  do sneaky things to still work.
* see if we can implement 3.x normGen (separate norms) with segmentinfo metadata. while in
3.x lucene this was actually per-commit, since 3.x support
  is read-only we can effectively treat it as per-segment this way.
* rename stuff so that we have a clearer naming for some of these classes.

I'm also probably missing a few other things. In general I'm pretty happy with the "metadata"
key-value attributes api versus subclassing. 

I tried to make subclassing work, but subclassing turned really ugly fast and made various
codec components too tightly-coupled, e.g. 
if someone wants to combine a CompressedStoredFields with a PerFieldPostingsFormat and SpecialTermVectors,
what would the impls be :). 

So the overly simple Map<String,String> avoids these issues, and hey its just metadata
after all so I don't think anything more complex is really needed. 

> Refactor SegmentInfo / FieldInfo to make them extensible
> --------------------------------------------------------
>                 Key: LUCENE-4055
>                 URL:
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: core/codecs
>            Reporter: Andrzej Bialecki 
>            Assignee: Robert Muir
>             Fix For: 4.0
> After LUCENE-4050 is done the resulting SegmentInfo / FieldInfo classes should be made
abstract so that they can be extended by Codec-s.

This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:!default.jspa
For more information on JIRA, see:


To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message