hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Hong Tang (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HADOOP-3315) New binary file format
Date Wed, 28 May 2008 07:19:45 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-3315?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12600391#action_12600391

Hong Tang commented on HADOOP-3315:

Sorry I am late to the party. (I just joined the Grid group on 05/12). 

A few questions/suggestions: 
- Do we assume a key-value pair always fits in one block?
- It seems that each TFile would contain 1 or 2 columns (1 for keyless columns, and 2 for
key-ed columns). May we have a need to put more than 2 columns in one TFile?
- Looks like TFile does not concern with the support of sub-columns (as in BigTable's column
family). One down-side is that application must read a whole column family value and deserialize
the internal sub-column-key-value pairs. This may be okay if we assume a column family cell
fits in a block, but may not be feasible if a column family cell contains thousands or even
more sub-columns and may span multiple blocks.
- What is vint? Is it some sort of variable-length integer encoding? AFAIK, byte-oriented
decoding is very slow (you may end up using only a small fraction of the available memory
bandwidth). An alternative scheme could be to use some special characters to separate the
keys and values, and escape the possible occurance of such special characters in keys and
values when we write the file. If the key and value sizes are small, this may give us better
performance. [Need verification with experimentation.]
- Should we put a size limit on the TFile (Owen, is your 5GB figure an average size or worst
case)? It can help simplify the design by assuming the index part can fully fit in memory
for a TFile. Other than DFS has issues with too many files, what might be reasons we not restricting
size on a TFile? Do we simply offload such decision to the application side (that they must
pick a right partition factor).
- I think we need to support embedding uninterpreted application-specific (application here
means the direct user of TFile) metadata in a TFile. Without such support, we may end up having
to save such information in a separate file that goes in parallel with each TFile.

> New binary file format
> ----------------------
>                 Key: HADOOP-3315
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3315
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: io
>            Reporter: Owen O'Malley
>            Assignee: Srikanth Kakani
>         Attachments: Tfile-1.pdf, TFile-2.pdf
> SequenceFile's block compression format is too complex and requires 4 codecs to compress
or decompress. It would be good to have a file format that only needs 

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message