lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Lance Norskog (JIRA)" <>
Subject [jira] Commented: (LUCENE-2373) Change StandardTermsDictWriter to work with streaming and append-only filesystems
Date Thu, 29 Apr 2010 21:06:54 GMT


Lance Norskog commented on LUCENE-2373:

bq. Lance: yes. The original use case I had in mind was HDFS (Hadoop File System) which already
implements on-the-fly checksums. If we go the way that Mike suggested, i.e. implementing a
separate codec, then this should be a simple addition. We could also perhaps structure this
as a codec wrapper so that this capability can be applied to other codecs too.

+1 for in Lucene itself. Lots of large installations don't use HDFS to move shards around.
Also, the HDFS checksum only counts after the file has touched down at the HDFS portal: there
are error rates in local RAM, local hard disk, shared file systems and network I/O. Doing
the checksum at the origin is more useful.

> Change StandardTermsDictWriter to work with streaming and append-only filesystems
> ---------------------------------------------------------------------------------
>                 Key: LUCENE-2373
>                 URL:
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>            Reporter: Andrzej Bialecki 
>             Fix For: 3.1
> Since early 2.x times Lucene used a skip/seek/write trick to patch the length of the
terms dict into a place near the start of the output data file. This however made it impossible
to use Lucene with append-only filesystems such as HDFS.
> In the post-flex trunk the following code in StandardTermsDictWriter initiates this:
> {code}
>     // Count indexed fields up front
>     CodecUtil.writeHeader(out, CODEC_NAME, VERSION_CURRENT); 
>     out.writeLong(0);                             // leave space for end index pointer
> {code}
> and completes this in close():
> {code}
>       out.writeLong(dirStart);
> {code}
> I propose to change this layout so that this pointer is stored simply at the end of the
file. It's always 8 bytes long, and we known the final length of the file from Directory,
so it's a single additional seek(length - 8) to read it, which is not much considering the

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message