hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Doug Cutting (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HADOOP-3315) New binary file format
Date Fri, 25 Apr 2008 21:41:55 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-3315?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12592515#action_12592515

Doug Cutting commented on HADOOP-3315:

[Meta comment: I wish folks would just describe problems in an issue's description, and leave
solutions to the comments.  Descriptions are appended to every email message.  Also, solutions
change as a result of discussion, while the problem should not.]

Is this a format just for compressed sequence files, or for all sequence files?

Is this intended as a replacement for MapFile too?

I think some kind of a magic number header at the start files is good to have.  That would
also permit back-compatibility with SequenceFile in this case.

In the index, what is "first record idx" -- is that the key or the ordinal position of the
first entry?

> New binary file format
> ----------------------
>                 Key: HADOOP-3315
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3315
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: io
>            Reporter: Owen O'Malley
> SequenceFile's block compression format is too complex and requires 4 codecs to compress
or decompress. I would propose that we move to:
> {code}
> block 1
> block 2
> ...
> index
> tail
> {code}
> where block is compressed, and contain:
> {code}
> key/value1: key len (vint), key, value len (vint), value
> key/value 2
> ...
> {code}
> The index would be compressed and contain:
> {code}
> block 1: offset, first record idx
> block 2: offset, first record idx
> block 3: offset, first record idx:
> ...
> {code}
> and the tail would look like:
> {code}
> key class name
> value class name
> index kind (none, keys, keys+bloom filter)
> format version
> offset of tail
> offset of index
> {code}
> Then extensions of this format would put more indexes between the last block and the
start of the index. So for example, the first key of each block:
> {code}
> first key of block 1: key len (vint), key
> first key of block 2
> ...
> offset of start key index
> {code}
> Another reasonable extension of the key index would be a bloom filter of the keys:
> {code}
> bloom filter serialization
> offset of bloom filter index start
> {code}

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message