hadoop-common-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Owen O'Malley (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HADOOP-6685) Change the generic serialization framework API to use serialization-specific bytes instead of Map<String,String> for configuration
Date Wed, 17 Nov 2010 05:14:22 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-6685?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12932817#action_12932817

Owen O'Malley commented on HADOOP-6685:

Is there a strong reason to use ProtocolBuffers here rather than Avro, which is already a
dependency and provides similar functionality?

It isn't clear what context you mean here:
- Giving the user the ability to use ProtocolBuffers
- Using protocol buffers for the metadata

For the first point, ProtocolBuffers is a extremely well engineered and documented project.
The fit and finish are excellent. It has been finely honed by years of extensive use in production
systems. Providing the capability to natively run ProtocolBuffers through the pipeline without
third party add ons is a big win. See Kevin's presentation about using Protocol Buffers at
Twitter http://www.slideshare.net/kevinweil/protocol-buffers-and-hadoop-at-twitter.

For the second point,  Avro is completely unsuitable for that context. For the serializer's
metadata, I need to encode a singleton object. With Avro, I would need to encode the schema
and then the metadata information. To add insult to injury, the schema will be substantially
larger than the data. With ProtocolBuffers, I just encode the data. The metadata is part of
the record. In other contexts where there are a lot of the same objects being serialized,
Avro is more efficient. This context is very different. As a final point, as I've told you
previously, the Avro setup is very expensive. Writing a 2 row sequence file is 10x slower
using Avro compared to ProtocolBuffers.

I understand that you'd like Avro to be the one and only serialization format that Hadoop
supports. Especially since that will help you push the development of Avro forward. You forcing
Avro on the users is unhealthy for Hadoop.

> Change the generic serialization framework API to use serialization-specific bytes instead
of Map<String,String> for configuration
> ----------------------------------------------------------------------------------------------------------------------------------
>                 Key: HADOOP-6685
>                 URL: https://issues.apache.org/jira/browse/HADOOP-6685
>             Project: Hadoop Common
>          Issue Type: Improvement
>            Reporter: Owen O'Malley
>            Assignee: Owen O'Malley
>             Fix For: 0.22.0
>         Attachments: libthrift.jar, serial.patch, serial4.patch, serial6.patch, SerializationAtSummit.pdf
> Currently, the generic serialization framework uses Map<String,String> for the
serialization specific configuration. Since this data is really internal to the specific serialization,
I think we should change it to be an opaque binary blob. This will simplify the interface
for defining specific serializations for different contexts (MAPREDUCE-1462). It will also
move us toward having serialized objects for Mappers, Reducers, etc (MAPREDUCE-1183).

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message