lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Andrzej Bialecki (JIRA)" <>
Subject [jira] [Commented] (LUCENE-4050) Change SegmentInfos format to plain text
Date Sun, 13 May 2012 14:14:49 GMT


Andrzej Bialecki  commented on LUCENE-4050:

Discussing this further with Robert, it looks like this is a (smaller) part of a larger issue,
in that SegmentInfo+FieldInfo should be made extensible and the process of reading/writing
this information should be *completely codec-specific*. Let's make a separate issue for that

And the smaller issue discussed here is to record only the information about a commit point
in a *completely codec-independent, versioned format*, whatever that format is. Let's call
it CommitInfo or whatever other name fits. This part would be written to a file that is separate
from the codec-dependent parts.

Regarding two-phase commit and checksums - one reason we have SegmentInfosWriter/Reader was
the AppendingCodec, because we couldn't make it work for append-only filesystems. However,
we could change the two-phase commit implementation to the following:

* write the data to the CommitInfo file
* write a marker indicating "end of data, checksum follows"
* finally, write the checksum

Then the reading code knows that:
* if there's a marker missing then the file is invalid
* if the marker is present then the checksum must be present too
* and the checksum must be correct.

This implementation doesn't require seek back / overwrite so it's supported on any filesystem.
> Change SegmentInfos format to plain text
> ----------------------------------------
>                 Key: LUCENE-4050
>                 URL:
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: core/codecs
>            Reporter: Andrzej Bialecki 
>             Fix For: 4.0
> I propose to change the format of SegmentInfos file (segments_NN) to use plain text instead
of the current binary format.
> SegmentInfos file represents a commit point, and it also declares what codecs were used
for writing each of the segments that the commit point consists of. However, this is a chicken
and egg situation - in theory the format of this file is customizable via Codec.getSegmentInfosFormat,
but in practice we have to first discover what is the codec implementation that wrote this
file - so the SegmentCoreReaders assumes a certain fixed binary layout of a preamble of this
file that contains the codec name... and then the file is read again, only this time using
the right Codec.
> This is ugly. Instead I propose to use a simple plain text format, either line oriented
properties or JSON, in such a way that newer versions could easily extend it, and which wouldn't
require any special Codec to read and parse. Consequently we could remove SegmentInfosFormat
altogether, and instead add SegmentInfoFormat (notice the singular) to Codec to read single
per-segment SegmentInfo-s in a codec-specific way. E.g. for Lucene40 codec we could either
add another file or we could extend the .fnm file (FieldInfos) to contain also this information.

> Then the plain text SegmentInfos would contain just the following information:
> * list of global files for this commit point (if any)
> * list of segments for this commit point, and their corresponding codec class names
> * user data map

This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:!default.jspa
For more information on JIRA, see:


To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message