hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Pete Wyckoff (JIRA)" <j...@apache.org>
Subject [jira] Updated: (HADOOP-4065) support for reading binary data from flat files
Date Tue, 16 Sep 2008 21:19:44 GMT

     [ https://issues.apache.org/jira/browse/HADOOP-4065?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Pete Wyckoff updated HADOOP-4065:
---------------------------------

    Description: 
like textinputformat - looking for a concrete implementation to read binary records from a
flat file (that may be compressed).

it's assumed that hadoop can't split such a file. so the inputformat can set splittable to
false.

tricky aspects are:
- how to know what class the file contains (has to be in a configuration somewhere).
- how to determine EOF (would be nice if hadoop can determine EOF and not have the deserializer
throw an exception  (which is hard to distinguish from a exception due to corruptions?)).
this is easy for non-compressed streams - for compressed streams - DecompressorStream has
a useful looking getAvailable() call - except the class is marked package private.



  was:
Implement generic FlatFileDeserializationRecordReader which assumes a Serialization Implementation
is specific in the JobConf and that once instantiated, that Serialization Implementation can
 figure out the actual class being Deserialized from the JobConf.  e.g., the JobConf specifies
RecordIOSerialization and then the specific class is LogRecordObject. 

Another way one might to do this is to use the SerializationFactory to do the lookup of the
Serialization Implementation; however, this requires all Serialization Implementations to
be known apriori and registered and goes against the spirit of a very generic FlatFileDeserializeRecordReader.
(see below re: adding Serialization implementations to contrib).

To ensure it is generic, I propose implementing the following Serialization implementations:

1. RecordIOSerialization
2. LineReaderSerialization
3. ThriftSerialization

The first 2 should go in io/serialization and the Thrift one in contrib somewhere. 




reverting to original description :)


> support for reading binary data from flat files
> -----------------------------------------------
>
>                 Key: HADOOP-4065
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4065
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: mapred
>            Reporter: Joydeep Sen Sarma
>         Attachments: HADOOP-4065.0.txt, HADOOP-4065.1.txt, ThriftFlatFile.java
>
>
> like textinputformat - looking for a concrete implementation to read binary records from
a flat file (that may be compressed).
> it's assumed that hadoop can't split such a file. so the inputformat can set splittable
to false.
> tricky aspects are:
> - how to know what class the file contains (has to be in a configuration somewhere).
> - how to determine EOF (would be nice if hadoop can determine EOF and not have the deserializer
throw an exception  (which is hard to distinguish from a exception due to corruptions?)).
this is easy for non-compressed streams - for compressed streams - DecompressorStream has
a useful looking getAvailable() call - except the class is marked package private.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message