hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Joydeep Sen Sarma (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HADOOP-4065) support for reading binary data from flat files
Date Thu, 18 Sep 2008 23:42:44 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-4065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12632451#action_12632451
] 

Joydeep Sen Sarma commented on HADOOP-4065:
-------------------------------------------

couple of comments on the code:

SerializationContext<R> sinfo = (SerializationContext<R>)ReflectionUtils.newInstance(sinfoClass,
conf);
sinfo.setConf(conf);

the setConf call is redundant since SerializationContext is configurable

key.set(rnum++);
if (key == null)
    key = createKey();

switch order? (or maybe the createKey()/createValue() is not required?)

otherwise looks good.

wrt some of Tom's comments:

> The row numbers you have are actually the row number within the split, which might be
confusing
the inputformat is not splittable - so we are safe here

> Is a SerializationContext class needed? 

Very much so. Let me walk through the Hive use case:
- Hive knows the deserialization class for each file. However - it knows this through metadata
about the _file_.  (The file belongs to a table that has some metadata). This metadata is
passed to mappers through the configuration.
- In this case the mapping is not from a class -> deserializer but from a file -> deserializer
- and the ability to bootstrap the serialization factory from the configuration is critical
(the configuration has both the file name and the metadata about the file name)

This also seems to be the hadoop style of doing things (all implementations can be configurable)
- and i think if it covers the hive case - it would help others as well. In fact - i think
we should try to make this (configurable serialization factory pattern) a more fundamental
part of the infrastructure. it seems more general than the class->serialization way of
bootstrapping (de)serialization.








> support for reading binary data from flat files
> -----------------------------------------------
>
>                 Key: HADOOP-4065
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4065
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: mapred
>            Reporter: Joydeep Sen Sarma
>         Attachments: FlatFileReader.java, HADOOP-4065.0.txt, HADOOP-4065.1.txt, HADOOP-4065.1.txt,
ThriftFlatFile.java
>
>
> like textinputformat - looking for a concrete implementation to read binary records from
a flat file (that may be compressed).
> it's assumed that hadoop can't split such a file. so the inputformat can set splittable
to false.
> tricky aspects are:
> - how to know what class the file contains (has to be in a configuration somewhere).
> - how to determine EOF (would be nice if hadoop can determine EOF and not have the deserializer
throw an exception  (which is hard to distinguish from a exception due to corruptions?)).
this is easy for non-compressed streams - for compressed streams - DecompressorStream has
a useful looking getAvailable() call - except the class is marked package private.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message