hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Doug Cutting (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HADOOP-3315) New binary file format
Date Wed, 10 Sep 2008 19:01:49 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-3315?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12629916#action_12629916

Doug Cutting commented on HADOOP-3315:

Looking at the specification document, I see the major stated goals are (1) language neutrality;
(2) extensibility, and (3) compatibility.  I assume these are relative to SequenceFile.

Langauge neutrality without an implementation in another language seems a risky claim.  SequenceFile's
only language dependence is in the naming of key and value classes, but implementations of
these classes are not required to process a SequenceFile.  SequenceFile, like TFile, lacks
implementations in other languages, so I don't yet see a clear advantage there.

(2) and (3) are very related.  SequenceFile has proven extensible and back-compatible.  Many
features have been added without breaking back-compatibility.  I don't see a qualitative advantage
here to the TFile format.

Perhaps you should include a section specifically addressing the advantages of TFile over
SequenceFile, how they are achieved and how they can be measured.

I suspect there may be other unstated goals in TFile.  The case for TFile should be clearly
made, as it adds a lot of code to Hadoop that must now be supported.  If it has demonstrable
advantages to SequenceFile and the case can be made that we will be able to retire SequenceFile
after it is added, then TFile should go forward.  Or if it is significantly simpler than SequenceFile
while providing the same features, that might make the case that it will be easier to reimplement
in other languages.  But if it is equivalently complex and supports more-or-less the same
features then it only adds baggage to the project.

> New binary file format
> ----------------------
>                 Key: HADOOP-3315
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3315
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: io
>            Reporter: Owen O'Malley
>            Assignee: Amir Youssefi
>         Attachments: HADOOP-3315_TFILE_PREVIEW.patch, HADOOP-3315_TFILE_PREVIEW_WITH_LZO_TESTS.patch,
TFile Specification Final.pdf
> SequenceFile's block compression format is too complex and requires 4 codecs to compress
or decompress. It would be good to have a file format that only needs 

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message