Return-Path: Delivered-To: apmail-hadoop-core-dev-archive@www.apache.org Received: (qmail 86707 invoked from network); 9 Sep 2008 23:20:08 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 9 Sep 2008 23:20:08 -0000 Received: (qmail 1363 invoked by uid 500); 9 Sep 2008 23:20:03 -0000 Delivered-To: apmail-hadoop-core-dev-archive@hadoop.apache.org Received: (qmail 1235 invoked by uid 500); 9 Sep 2008 23:20:02 -0000 Mailing-List: contact core-dev-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: core-dev@hadoop.apache.org Delivered-To: mailing list core-dev@hadoop.apache.org Received: (qmail 1211 invoked by uid 99); 9 Sep 2008 23:20:02 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 09 Sep 2008 16:20:02 -0700 X-ASF-Spam-Status: No, hits=-2000.0 required=10.0 tests=ALL_TRUSTED X-Spam-Check-By: apache.org Received: from [140.211.11.140] (HELO brutus.apache.org) (140.211.11.140) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 09 Sep 2008 23:19:12 +0000 Received: from brutus (localhost [127.0.0.1]) by brutus.apache.org (Postfix) with ESMTP id A80DB234C1DB for ; Tue, 9 Sep 2008 16:19:44 -0700 (PDT) Message-ID: <1923573096.1221002384687.JavaMail.jira@brutus> Date: Tue, 9 Sep 2008 16:19:44 -0700 (PDT) From: "Pete Wyckoff (JIRA)" To: core-dev@hadoop.apache.org Subject: [jira] Commented: (HADOOP-4065) support for reading binary data from flat files In-Reply-To: <165315971.1220482544298.JavaMail.jira@brutus> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-Virus-Checked: Checked by ClamAV on apache.org [ https://issues.apache.org/jira/browse/HADOOP-4065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12629654#action_12629654 ] Pete Wyckoff commented on HADOOP-4065: -------------------------------------- Yes, good point. I will change it to DeserializerTypedFile. But, the SequenceFileRecordReader is re-usable for all these. From the reader of a file that does its own deserializing of its types, it's all the same. With this interface, the SequenceFileRecordReader can read SequenceFiles, DeserializerTypedFiles (thrift, proto buffers, record io whatever) and any other self describing typed files; sequencefile's being one example of these. Otherwise, I don't see how not to be re-implementing the current SequenceFileRecordReader functionality for all these use cases?? -- pete > support for reading binary data from flat files > ----------------------------------------------- > > Key: HADOOP-4065 > URL: https://issues.apache.org/jira/browse/HADOOP-4065 > Project: Hadoop Core > Issue Type: Bug > Components: mapred > Reporter: Joydeep Sen Sarma > Attachments: HADOOP-4065.0.txt, ThriftFlatFile.java > > > like textinputformat - looking for a concrete implementation to read binary records from a flat file (that may be compressed). > it's assumed that hadoop can't split such a file. so the inputformat can set splittable to false. > tricky aspects are: > - how to know what class the file contains (has to be in a configuration somewhere). > - how to determine EOF (would be nice if hadoop can determine EOF and not have the deserializer throw an exception (which is hard to distinguish from a exception due to corruptions?)). this is easy for non-compressed streams - for compressed streams - DecompressorStream has a useful looking getAvailable() call - except the class is marked package private. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.