Return-Path: Delivered-To: apmail-hadoop-common-issues-archive@minotaur.apache.org Received: (qmail 23332 invoked from network); 16 Apr 2010 01:37:48 -0000 Received: from unknown (HELO mail.apache.org) (140.211.11.3) by 140.211.11.9 with SMTP; 16 Apr 2010 01:37:48 -0000 Received: (qmail 36052 invoked by uid 500); 16 Apr 2010 01:37:48 -0000 Delivered-To: apmail-hadoop-common-issues-archive@hadoop.apache.org Received: (qmail 36017 invoked by uid 500); 16 Apr 2010 01:37:47 -0000 Mailing-List: contact common-issues-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: common-issues@hadoop.apache.org Delivered-To: mailing list common-issues@hadoop.apache.org Received: (qmail 36007 invoked by uid 99); 16 Apr 2010 01:37:47 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 16 Apr 2010 01:37:47 +0000 X-ASF-Spam-Status: No, hits=-1297.8 required=10.0 tests=ALL_TRUSTED,AWL X-Spam-Check-By: apache.org Received: from [140.211.11.22] (HELO thor.apache.org) (140.211.11.22) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 16 Apr 2010 01:37:46 +0000 Received: from thor (localhost [127.0.0.1]) by thor.apache.org (8.13.8+Sun/8.13.8) with ESMTP id o3G1bPMt000933 for ; Thu, 15 Apr 2010 21:37:26 -0400 (EDT) Message-ID: <20841992.1101271381845967.JavaMail.jira@thor> Date: Thu, 15 Apr 2010 21:37:25 -0400 (EDT) From: "Hong Tang (JIRA)" To: common-issues@hadoop.apache.org Subject: [jira] Commented: (HADOOP-6708) New file format for very large records In-Reply-To: <23756292.153431271370052193.JavaMail.jira@thor> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/HADOOP-6708?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12857639#action_12857639 ] Hong Tang commented on HADOOP-6708: ----------------------------------- bq. What's the relationship between "blocks" and "chunks" in a TFile? A TFile contains zero or more compressed blocks. Each block contains sequences of key, value, key, value. Each value can contain 1 to more chunks. A block has a minimum size of 256KB. Whenever we accumulate enough data that exceeds the minimum block size, we "close" the current block and starts a new block. All blocks have their offsets and lengths recorded in some index section. bq. Is a record fully contained in a block? Yes. bq. If it compresses an 8 GB record down to, say, 2 GB, will that still require skipping chunk-wise through the compressed data? No, because it would be the last record in that block. With my suggested optimization, it would be an O(1) operation to skip that record. bq. Also how does TFile handle splits and resynchronizing? It doesn't seem like there's an InputFormat for it. Writing an input format for it is pretty easy, I believe Owen has a prototype of OFile on top of TFile on his laptop. :) Generally, you would extend from FileInputFormat, and your record reader would be backed up by a TFile.Reader.Scanner created by TFile.Reader.createScannerByByteRange(long offset, long length). Internally, this method would move the bytes range to the boundary of TFile compression blocks (through the block index it maintains). > New file format for very large records > -------------------------------------- > > Key: HADOOP-6708 > URL: https://issues.apache.org/jira/browse/HADOOP-6708 > Project: Hadoop Common > Issue Type: New Feature > Components: io > Reporter: Aaron Kimball > Assignee: Aaron Kimball > Attachments: lobfile.pdf > > > A file format that handles multi-gigabyte records efficiently, with lazy disk access -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: https://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira