hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Hong Tang (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HADOOP-3315) New binary file format
Date Wed, 10 Sep 2008 03:19:44 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-3315?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12629690#action_12629690

Hong Tang commented on HADOOP-3315:

    * Is Util#memcmp different from WritableComparator#compareBytes()?
[Hong] Looks like they are the same. My oversight.

    * Shouldn't BoundedByteArrayOutputStream extend ByteArrayOutputStream?
[Hong] No, it should not. ByteArrayOutputStream does not bound the # of bytes written to the
output stream. It automatically increases the size of the internal buffer, which is not what
we want. There is little to be shared between the two except for the buffer and count definition.

    * the VLong code duplicates code in WritableUtils, no?
[Hong] No. The new VLong format enlarges the range of integers that can be encoded with 2-4
bytes (with the expense of reduced range of negative integers in 1 byte case). The new format
can represent integers from -32 to 127 with 1B, -5120 to 5119 for 2B, -1M to 1M-1 for 3B,
and -128M to 128M-1 for 4B. Comparing WritableUtils.VLong: 1B -112 to 127, 2B: -256 to 255,
3B: -64K to 64K-1, 4B: -16M to 16M-1. This encoding scheme is more efficient for TFile where
we may have lots of small integers but never small negative integers.

    * readString/writeString duplicates Text methods.
[Hong] Sort of, but because Text.readString and writeString uses WriteableUtils.VInt, if we
were to use these methods directly, we would have to document WritableUtils's VInt/VLong encoding
as well, which is kind of confusing to define two VInt/VLong standards in one spec.

    * should the Compression enum be simply a new method on CompressionCodecFactory? If not,
shouldn't it go in the io.compress package?
[Hong] This part is a quick implementation of what should be a more extensible compression
algorithm management in the future. The reason we did not directly use CompressionCodecFactory
is because CompressionCodecFactory.getCodec()  expects a path and the uses the suffix portion
of the path to find the codec is based on some configuration. Directly using it would break
TFile spec's requirement for language/implementation neutrality. On the other hand, it may
be nice to include standard string name to compression codec definition in Hadoop.

> New binary file format
> ----------------------
>                 Key: HADOOP-3315
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3315
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: io
>            Reporter: Owen O'Malley
>            Assignee: Amir Youssefi
>         Attachments: HADOOP-3315_TFILE_PREVIEW.patch, HADOOP-3315_TFILE_PREVIEW_WITH_LZO_TESTS.patch,
TFile Specification Final.pdf
> SequenceFile's block compression format is too complex and requires 4 codecs to compress
or decompress. It would be good to have a file format that only needs 

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message