Return-Path: Delivered-To: apmail-hadoop-common-issues-archive@minotaur.apache.org Received: (qmail 93397 invoked from network); 16 Apr 2010 00:11:12 -0000 Received: from unknown (HELO mail.apache.org) (140.211.11.3) by 140.211.11.9 with SMTP; 16 Apr 2010 00:11:12 -0000 Received: (qmail 82142 invoked by uid 500); 16 Apr 2010 00:11:12 -0000 Delivered-To: apmail-hadoop-common-issues-archive@hadoop.apache.org Received: (qmail 82118 invoked by uid 500); 16 Apr 2010 00:11:12 -0000 Mailing-List: contact common-issues-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: common-issues@hadoop.apache.org Delivered-To: mailing list common-issues@hadoop.apache.org Received: (qmail 82076 invoked by uid 99); 16 Apr 2010 00:11:12 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 16 Apr 2010 00:11:12 +0000 X-ASF-Spam-Status: No, hits=-1297.4 required=10.0 tests=ALL_TRUSTED,AWL X-Spam-Check-By: apache.org Received: from [140.211.11.22] (HELO thor.apache.org) (140.211.11.22) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 16 Apr 2010 00:11:11 +0000 Received: from thor (localhost [127.0.0.1]) by thor.apache.org (8.13.8+Sun/8.13.8) with ESMTP id o3G0AoXd028556 for ; Thu, 15 Apr 2010 20:10:51 -0400 (EDT) Message-ID: <32032177.155531271376650748.JavaMail.jira@thor> Date: Thu, 15 Apr 2010 20:10:50 -0400 (EDT) From: "Aaron Kimball (JIRA)" To: common-issues@hadoop.apache.org Subject: [jira] Commented: (HADOOP-6708) New file format for very large records In-Reply-To: <23756292.153431271370052193.JavaMail.jira@thor> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/HADOOP-6708?page=3Dcom.atlassia= n.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=3D128= 57609#action_12857609 ]=20 Aaron Kimball commented on HADOOP-6708: --------------------------------------- Hong, bq. * Length =EF=AC=81elds are encoded as integers, not longs. This does no= t support records > 2 GB. bq. This is an intentional restriction. All integers are in VInt/VLong form= at which is fully wire compatible. You can easily make a case to request su= ch limit be lifted. So does this mean that the API for TFile could be changed without complicat= ion to accept/return {{long}} values? I read the TFile spec and it points o= ut in several different locations the 2 GB value limit. By reading that, it= sounds as though other aspects of TFile may break based on the assumed int= eger size there. bq. Even if you do not know the length of the record you write (namely spec= ifying -1 during writing), you can still efficiently skip a record (even af= ter partially consuming some bytes of the record). Isn't it sufficient for = your case? Searching for a synchronization boundary is very inefficient tha= n length-prefixed encoding. Data comes to me from JDBC through an InputStream or a Reader that I am not= sure how long it is. I read from that InputStream/Reader and write its con= tents into an OutputStream/Writer that dumps into a file (LobFile). In the = case where I have a character-based Reader, I know how many characters I ha= ve, which is a lower bound on the number of bytes, but not exact. So my pl= an was to seek ahead by that much, then search for the boundary. Assuming m= ost characters are one byte, the search will be pretty quick. How does TFile support length skipping if you don't pre-declare the lengths= ? > New file format for very large records > -------------------------------------- > > Key: HADOOP-6708 > URL: https://issues.apache.org/jira/browse/HADOOP-6708 > Project: Hadoop Common > Issue Type: New Feature > Components: io > Reporter: Aaron Kimball > Assignee: Aaron Kimball > Attachments: lobfile.pdf > > > A file format that handles multi-gigabyte records efficiently, with lazy = disk access --=20 This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: htt= ps://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira