hadoop-common-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Scott Carey (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HADOOP-6685) Change the generic serialization framework API to use serialization-specific bytes instead of Map<String,String> for configuration
Date Wed, 15 Dec 2010 20:14:11 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-6685?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12971812#action_12971812

Scott Carey commented on HADOOP-6685:

I apologize for replying to something from the conversation from 30 days ago.  But this may
be useful.

{quote}For the second point, Avro is completely unsuitable for that context. For the serializer's
metadata, I need to encode a singleton object. With Avro, I would need to encode the schema
and then the metadata information. To add insult to injury, the schema will be substantially
larger than the data. With ProtocolBuffers, I just encode the data.{quote}

This is not true, all configurations could have the same Avro schema.  An Avro schema that
defines all possibilities is equivalent to tagging fields with type tags.
Essentially the schema would be a record with an array of fields, with each field a union
of all possible field types.  The current Avro API for this use case is clunky, perhaps Avro
could make this easier, but you can do dynamic typing and tagged fields in Avro.  

This means you don't have to serialize the schema, and alleviates the use case here where
you just want to encode data and not a schema akin to some PB/Thrift use cases.  It adds the
overhead of type tags and the objects generated via either Java Reflect or Generic APIs would
be cumbersome to use.  I would be willing to work on an API for Avro that makes this easier
for reading/writing a tagged tuple dynamic data type.

> Change the generic serialization framework API to use serialization-specific bytes instead
of Map<String,String> for configuration
> ----------------------------------------------------------------------------------------------------------------------------------
>                 Key: HADOOP-6685
>                 URL: https://issues.apache.org/jira/browse/HADOOP-6685
>             Project: Hadoop Common
>          Issue Type: Improvement
>            Reporter: Owen O'Malley
>            Assignee: Owen O'Malley
>             Fix For: 0.22.0
>         Attachments: serial.patch, serial4.patch, serial6.patch, serial7.patch, serial9.patch,
> Currently, the generic serialization framework uses Map<String,String> for the
serialization specific configuration. Since this data is really internal to the specific serialization,
I think we should change it to be an opaque binary blob. This will simplify the interface
for defining specific serializations for different contexts (MAPREDUCE-1462). It will also
move us toward having serialized objects for Mappers, Reducers, etc (MAPREDUCE-1183).

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message