hadoop-common-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Hong Tang (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HADOOP-6708) New file format for very large records
Date Fri, 16 Apr 2010 01:37:25 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-6708?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12857639#action_12857639
] 

Hong Tang commented on HADOOP-6708:
-----------------------------------

bq. What's the relationship between "blocks" and "chunks" in a TFile?
A TFile contains zero or more compressed blocks. Each block contains sequences of key, value,
key, value. Each value can contain 1 to more chunks. A block has a minimum size of 256KB.
Whenever we accumulate enough data that exceeds the minimum block size, we "close" the current
block and starts a new block. All blocks have their offsets and lengths recorded in some index
section.

bq. Is a record fully contained in a block?
Yes.

bq.  If it compresses an 8 GB record down to, say, 2 GB, will that still require skipping
chunk-wise through the compressed data?
No, because it would be the last record in that block. With my suggested optimization, it
would be an O(1) operation to skip that record.

bq. Also how does TFile handle splits and resynchronizing? It doesn't seem like there's an
InputFormat for it. 
Writing an input format for it is pretty easy, I believe Owen has a prototype of OFile on
top of TFile on his laptop. :) Generally, you would extend from FileInputFormat, and your
record reader would be backed up by a TFile.Reader.Scanner created by TFile.Reader.createScannerByByteRange(long
offset, long length). Internally, this method would move the bytes range to the boundary of
TFile compression blocks (through the block index it maintains). 

> New file format for very large records
> --------------------------------------
>
>                 Key: HADOOP-6708
>                 URL: https://issues.apache.org/jira/browse/HADOOP-6708
>             Project: Hadoop Common
>          Issue Type: New Feature
>          Components: io
>            Reporter: Aaron Kimball
>            Assignee: Aaron Kimball
>         Attachments: lobfile.pdf
>
>
> A file format that handles multi-gigabyte records efficiently, with lazy disk access

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Mime
View raw message