hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Mahadev konar (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HADOOP-4065) support for reading binary data from flat files
Date Mon, 08 Sep 2008 21:33:44 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-4065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12629298#action_12629298

Mahadev konar commented on HADOOP-4065:

we at yahoo have been working on similar kind of files where data is just stored as binary
data and is splittable. 


the  spec is old and needs to be updated. TFile is meant to be a sequence file replacement.

  A TFile is a container of key-value pairs. Both keys and values are type-less
  byte arrays. Keys can be up to 64KB, value length is not restricted. TFile
  further provides the following features:
- Block Compression.
- Named meta data blocks.
- Sorted or unsorted keys.
- Seek by key or by file offset.

We will update the specs on HADOOP-3315 by the end of this week. 

> support for reading binary data from flat files
> -----------------------------------------------
>                 Key: HADOOP-4065
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4065
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: mapred
>            Reporter: Joydeep Sen Sarma
> like textinputformat - looking for a concrete implementation to read binary records from
a flat file (that may be compressed).
> it's assumed that hadoop can't split such a file. so the inputformat can set splittable
to false.
> tricky aspects are:
> - how to know what class the file contains (has to be in a configuration somewhere).
> - how to determine EOF (would be nice if hadoop can determine EOF and not have the deserializer
throw an exception  (which is hard to distinguish from a exception due to corruptions?)).
this is easy for non-compressed streams - for compressed streams - DecompressorStream has
a useful looking getAvailable() call - except the class is marked package private.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message