Mailing-List: contact common-issues-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: common-issues@hadoop.apache.org
Message-ID: <20841992.1101271381845967.JavaMail.jira@thor>
Date: Thu, 15 Apr 2010 21:37:25 -0400 (EDT)
From: "Hong Tang (JIRA)" <jira@apache.org>
To: common-issues@hadoop.apache.org
Subject: [jira] Commented: (HADOOP-6708) New file format for very large
 records
In-Reply-To: <23756292.153431271370052193.JavaMail.jira@thor>
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit


    [ https://issues.apache.org/jira/browse/HADOOP-6708?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12857639#action_12857639 ] 

Hong Tang commented on HADOOP-6708:
-----------------------------------

bq. What's the relationship between "blocks" and "chunks" in a TFile?
A TFile contains zero or more compressed blocks. Each block contains sequences of key, value, key, value. Each value can contain 1 to more chunks. A block has a minimum size of 256KB. Whenever we accumulate enough data that exceeds the minimum block size, we "close" the current block and starts a new block. All blocks have their offsets and lengths recorded in some index section.

bq. Is a record fully contained in a block?
Yes.

bq.  If it compresses an 8 GB record down to, say, 2 GB, will that still require skipping chunk-wise through the compressed data?
No, because it would be the last record in that block. With my suggested optimization, it would be an O(1) operation to skip that record.

bq. Also how does TFile handle splits and resynchronizing? It doesn't seem like there's an InputFormat for it. 
Writing an input format for it is pretty easy, I believe Owen has a prototype of OFile on top of TFile on his laptop. :) Generally, you would extend from FileInputFormat, and your record reader would be backed up by a TFile.Reader.Scanner created by TFile.Reader.createScannerByByteRange(long offset, long length). Internally, this method would move the bytes range to the boundary of TFile compression blocks (through the block index it maintains). 

> New file format for very large records
> --------------------------------------
>
>                 Key: HADOOP-6708
>                 URL: https://issues.apache.org/jira/browse/HADOOP-6708
>             Project: Hadoop Common
>          Issue Type: New Feature
>          Components: io
>            Reporter: Aaron Kimball
>            Assignee: Aaron Kimball
>         Attachments: lobfile.pdf
>
>
> A file format that handles multi-gigabyte records efficiently, with lazy disk access

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira