hadoop-common-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Owen O'Malley (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HADOOP-6685) Change the generic serialization framework API to use serialization-specific bytes instead of Map<String,String> for configuration
Date Mon, 22 Nov 2010 22:50:19 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-6685?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12934644#action_12934644

Owen O'Malley commented on HADOOP-6685:

The first is that no change is needed in SequenceFile unless we want to support Avro, but,
given that Avro data files were designed for this, and are multi-lingual, why change the SequenceFile
format solely to support Avro? Are Avro data files insufficient? Note that Thrift and Protocol
Buffers can be stored in today's SequenceFiles.

This isn't true. SequenceFile needs to be changed to support the new serialization API. The
class name isn't sufficient to determine the serialization. Furthermore, you can't implement
context sensitive serializations MAPREDUCE-1462 without the changes to SequenceFile.

Are Avro data files insufficient?

Yes. They don't support indices. They don't support key, value pairs. They don't support other
types like Writables. Furthermore, our users already heavily use SequenceFiles and don't want
to port to a new file format. Extending SequenceFile gives them more flexibility.

I wonder if JSON might be a good nestable format for serialization metadata? JSON supports
nesting, and distinguishes numeric, boolean and string types. With Jackson, one can serialize
and deserialize Java objects as JSON, to get compile-time type checking.

In MAPREDUCE-980, you took out the custom JSON parser and replaced it with calls into Avro.
Using ProtoBuf is efficient and meant that I wrote 2 lines of code. If I used JSON, I would
need to write a parser and printer.

> Change the generic serialization framework API to use serialization-specific bytes instead
of Map<String,String> for configuration
> ----------------------------------------------------------------------------------------------------------------------------------
>                 Key: HADOOP-6685
>                 URL: https://issues.apache.org/jira/browse/HADOOP-6685
>             Project: Hadoop Common
>          Issue Type: Improvement
>            Reporter: Owen O'Malley
>            Assignee: Owen O'Malley
>             Fix For: 0.22.0
>         Attachments: libthrift.jar, serial.patch, serial4.patch, serial6.patch, serial7.patch,
> Currently, the generic serialization framework uses Map<String,String> for the
serialization specific configuration. Since this data is really internal to the specific serialization,
I think we should change it to be an opaque binary blob. This will simplify the interface
for defining specific serializations for different contexts (MAPREDUCE-1462). It will also
move us toward having serialized objects for Mappers, Reducers, etc (MAPREDUCE-1183).

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message