lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Lance Norskog (JIRA)" <j...@apache.org>
Subject [jira] Commented: (LUCENE-2373) Change StandardTermsDictWriter to work with streaming and append-only filesystems
Date Tue, 27 Apr 2010 03:29:32 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-2373?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12861214#action_12861214
] 

Lance Norskog commented on LUCENE-2373:
---------------------------------------

Does this make it possible to add a good checksum? 

The Cloud and NRT architectures involve copying lots of segment files around, and disk&RAM&network
bandwidth all have error rates. It would be great if the process of making an index file included,
on the fly, the creation of a solid checksum that is then baked into the file at the last
moment. It should also be in the segments.gen file, but it is more important that the file
should have the checksum embedded such that walking the whole file gives a fixed value.

> Change StandardTermsDictWriter to work with streaming and append-only filesystems
> ---------------------------------------------------------------------------------
>
>                 Key: LUCENE-2373
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2373
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>            Reporter: Andrzej Bialecki 
>             Fix For: 3.1
>
>
> Since early 2.x times Lucene used a skip/seek/write trick to patch the length of the
terms dict into a place near the start of the output data file. This however made it impossible
to use Lucene with append-only filesystems such as HDFS.
> In the post-flex trunk the following code in StandardTermsDictWriter initiates this:
> {code}
>     // Count indexed fields up front
>     CodecUtil.writeHeader(out, CODEC_NAME, VERSION_CURRENT); 
>     out.writeLong(0);                             // leave space for end index pointer
> {code}
> and completes this in close():
> {code}
>       out.seek(CodecUtil.headerLength(CODEC_NAME));
>       out.writeLong(dirStart);
> {code}
> I propose to change this layout so that this pointer is stored simply at the end of the
file. It's always 8 bytes long, and we known the final length of the file from Directory,
so it's a single additional seek(length - 8) to read it, which is not much considering the
benefits.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message