Return-Path: Delivered-To: apmail-hadoop-core-dev-archive@www.apache.org Received: (qmail 21859 invoked from network); 18 Sep 2008 14:04:08 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 18 Sep 2008 14:04:08 -0000 Received: (qmail 5692 invoked by uid 500); 18 Sep 2008 14:04:02 -0000 Delivered-To: apmail-hadoop-core-dev-archive@hadoop.apache.org Received: (qmail 5663 invoked by uid 500); 18 Sep 2008 14:04:02 -0000 Mailing-List: contact core-dev-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: core-dev@hadoop.apache.org Delivered-To: mailing list core-dev@hadoop.apache.org Received: (qmail 5652 invoked by uid 99); 18 Sep 2008 14:04:02 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 18 Sep 2008 07:04:02 -0700 X-ASF-Spam-Status: No, hits=-2000.0 required=10.0 tests=ALL_TRUSTED X-Spam-Check-By: apache.org Received: from [140.211.11.140] (HELO brutus.apache.org) (140.211.11.140) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 18 Sep 2008 14:03:12 +0000 Received: from brutus (localhost [127.0.0.1]) by brutus.apache.org (Postfix) with ESMTP id D1659234C1E2 for ; Thu, 18 Sep 2008 07:03:44 -0700 (PDT) Message-ID: <1756242837.1221746624856.JavaMail.jira@brutus> Date: Thu, 18 Sep 2008 07:03:44 -0700 (PDT) From: "Tom White (JIRA)" To: core-dev@hadoop.apache.org Subject: [jira] Commented: (HADOOP-4065) support for reading binary data from flat files In-Reply-To: <165315971.1220482544298.JavaMail.jira@brutus> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-Virus-Checked: Checked by ClamAV on apache.org [ https://issues.apache.org/jira/browse/HADOOP-4065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12632217#action_12632217 ] Tom White commented on HADOOP-4065: ----------------------------------- A few comments: Could the types be called FlatFileInputFormat and FlatFileRecordReader? Is a SerializationContext class needed? The Serialization can be got from the SerializationFactory. It just needs to know the base class (Writable, TBase etc). A second configuration parameter is needed to specify the concrete class, but I don't see why the FlatFileDeserializerRecordReader can't just get these two classes from the Configuration itself. Can the classes go in the org.apache.hadoop.contrib.serialization.mapred package to echo the main mapred package? When HADOOP-1230 is done an equivalent could then go in the mapreduce package. I agree it would be good to have tests for Writable, Java Serialization and Thrift to test the abstraction. Shouldn't keys be file offsets, similar to TextInputFormat? The row numbers you have are actually the row number within the split, which might be confusing (and they're not unique per file). > support for reading binary data from flat files > ----------------------------------------------- > > Key: HADOOP-4065 > URL: https://issues.apache.org/jira/browse/HADOOP-4065 > Project: Hadoop Core > Issue Type: Bug > Components: mapred > Reporter: Joydeep Sen Sarma > Attachments: FlatFileReader.java, HADOOP-4065.0.txt, HADOOP-4065.1.txt, HADOOP-4065.1.txt, ThriftFlatFile.java > > > like textinputformat - looking for a concrete implementation to read binary records from a flat file (that may be compressed). > it's assumed that hadoop can't split such a file. so the inputformat can set splittable to false. > tricky aspects are: > - how to know what class the file contains (has to be in a configuration somewhere). > - how to determine EOF (would be nice if hadoop can determine EOF and not have the deserializer throw an exception (which is hard to distinguish from a exception due to corruptions?)). this is easy for non-compressed streams - for compressed streams - DecompressorStream has a useful looking getAvailable() call - except the class is marked package private. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.