hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Owen O'Malley (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HADOOP-3315) New binary file format
Date Sat, 26 Apr 2008 16:32:55 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-3315?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12592584#action_12592584
] 

Owen O'Malley commented on HADOOP-3315:
---------------------------------------

{quote}
Is this a format just for compressed sequence files, or for all sequence files?
{quote}

The issue is most critical for compressed sequence files, but it would make sense to make
the compression optional. I would not support value compression.

{quote}
Is this intended as a replacement for MapFile too?
{quote}

yes

{quote}
I think some kind of a magic number header at the start files is good to have. That would
also permit back-compatibility with SequenceFile in this case.
{quote}

that probably makes sense, although my desire is to make applications *not* read the header
and only read the tail, which has the meta data including the index.

{quote}
In the index, what is "first record idx" - is that the key or the ordinal position of the
first entry?
{quote}

It is the row id of the first row of that block. So that you can support seek to a given row
number, which is useful if you have a bunch of files that correspond to different columns
in a big table. You would make splits that look like rows 1000-2000 and you can map that across
multiple files.

{code}
magic (4 bytes)
block 0 .. n
index
tail
{code}

and the tail looks like:

{code}
key class name
value class name
index kind (none, keys, keys+bloom filter)
compression kind (none, default, lzo)
format version
offset of tail
offset of index
{code}

> New binary file format
> ----------------------
>
>                 Key: HADOOP-3315
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3315
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: io
>            Reporter: Owen O'Malley
>
> SequenceFile's block compression format is too complex and requires 4 codecs to compress
or decompress. It would be good to have a file format that only needs 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message