hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Owen O'Malley (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HADOOP-3315) New binary file format
Date Wed, 14 May 2008 04:39:56 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-3315?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12596621#action_12596621
] 

Owen O'Malley commented on HADOOP-3315:
---------------------------------------

That is my fault. Let's say that you have:

{code}
block size = 1 mb
key size = 100 bytes
file size = 5 gb
{code}

that means that you have:

{code}
key length vint = 1 byte
row idx vint = 4 bytes
block offset vint = 4 bytes
blocks = 5000
key index = 500kb
row index = 40kb
{code}

which makes it very cheap to read the row index if you don't need the key index (ie. a single
64k read of the tail of the file will likely get the whole thing). The whole key index is
not huge, but it is much bigger than the row index and may be compressed and therefore more
expensive to read.

> New binary file format
> ----------------------
>
>                 Key: HADOOP-3315
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3315
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: io
>            Reporter: Owen O'Malley
>            Assignee: Srikanth Kakani
>         Attachments: Tfile-1.pdf, TFile-2.pdf
>
>
> SequenceFile's block compression format is too complex and requires 4 codecs to compress
or decompress. It would be good to have a file format that only needs 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message