hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Hong Tang (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HADOOP-3315) New binary file format
Date Wed, 10 Sep 2008 04:23:44 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-3315?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12629699#action_12629699

Hong Tang commented on HADOOP-3315:

Just checked the code (CodedInputStream.java). The protocol buffer VInt (or VLong) is pretty
interesting. They first transform the integer through ZigZag encoding, which essentially transform
the long into leading 000+[n]+[sign]. They then encode the n+1 bits using ceiling((n+1)/7)
bytes (in little-endian style). So effectively, 1B can represent -64 to 63, 2B: -8K to 8K-1,
3B: -1M to 1M-1, 4B: -128M to 128M. Comparing to my encoding scheme, I basically traded off
some 2B encoding space for expanded 1B coverage. Additionally, protocol buffer's decoding
requires you to read byte after byte, while both WritableUtils and my VLong can detect the
length of the whole encoding after the first byte.

> New binary file format
> ----------------------
>                 Key: HADOOP-3315
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3315
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: io
>            Reporter: Owen O'Malley
>            Assignee: Amir Youssefi
>         Attachments: HADOOP-3315_TFILE_PREVIEW.patch, HADOOP-3315_TFILE_PREVIEW_WITH_LZO_TESTS.patch,
TFile Specification Final.pdf
> SequenceFile's block compression format is too complex and requires 4 codecs to compress
or decompress. It would be good to have a file format that only needs 

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message