hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Andrzej Bialecki (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HADOOP-3315) New binary file format
Date Sat, 26 Apr 2008 00:19:55 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-3315?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12592534#action_12592534
] 

Andrzej Bialecki  commented on HADOOP-3315:
-------------------------------------------

Currently SequenceFile-s may contain Metadata. In my applications I never found use for this,
because this Metadata needs to be written out right after the file is opened, and cannot be
updated later. A much better model for my needs would be to write the Metadata record right
before closing the file (in the "tail" section above), so that it can be updated until the
end, with e.g. record count.

> New binary file format
> ----------------------
>
>                 Key: HADOOP-3315
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3315
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: io
>            Reporter: Owen O'Malley
>
> SequenceFile's block compression format is too complex and requires 4 codecs to compress
or decompress. I would propose that we move to:
> {code}
> block 1
> block 2
> ...
> index
> tail
> {code}
> where block is compressed, and contain:
> {code}
> key/value1: key len (vint), key, value len (vint), value
> key/value 2
> ...
> {code}
> The index would be compressed and contain:
> {code}
> block 1: offset, first record idx
> block 2: offset, first record idx
> block 3: offset, first record idx:
> ...
> {code}
> and the tail would look like:
> {code}
> key class name
> value class name
> index kind (none, keys, keys+bloom filter)
> format version
> offset of tail
> offset of index
> {code}
> Then extensions of this format would put more indexes between the last block and the
start of the index. So for example, the first key of each block:
> {code}
> first key of block 1: key len (vint), key
> first key of block 2
> ...
> offset of start key index
> {code}
> Another reasonable extension of the key index would be a bloom filter of the keys:
> {code}
> bloom filter serialization
> offset of bloom filter index start
> {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message