hadoop-common-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Chris Douglas (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HADOOP-6685) Change the generic serialization framework API to use serialization-specific bytes instead of Map<String,String> for configuration
Date Fri, 19 Nov 2010 22:06:20 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-6685?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12934008#action_12934008
] 

Chris Douglas commented on HADOOP-6685:
---------------------------------------

bq. Creating new concrete data formats that are functionally equivalent to other concrete
formats decreases ecosystem interoperability, flexibility and maintainability. Above I cited
the Dremel paper, whose section 2 outlines a scenario that they argue is only possible because
all of the systems involved share a single common serialization and file format.

Your argument is: Avro and its file format must be maximally attractive to realize the benefits
of a common serialization and file format across _all installations_ of Hadoop. Someone at
Google wrote a paper about how well that worked inside their company. SequenceFiles cannot
support multiple serializations, because that would make adoption of those components (or
"alternatives" with exactly its same design points) less attractive.

Users are going to put their data in whatever form is convenient, often for legacy/interoperability
reasons. Companies like Yahoo are interested in ubiquitous data formats, because they have
a lot of it and many groups building on top of it. And it employs teams of people to curate
its data. Others may be less enthusiastic about your "one serialization/data format to rule
them all." Using your veto in Hadoop to prohibit SequenceFile or TFile from evolving into
a viable competitor/alternative to Avro's preferred format is invalid. If SequenceFile didn't
support Avro, would you drop your veto on this point?

bq. it didn't create a core dependency on ProtocolBuffers

Anywhere? Or as it's used in the current patch?

> Change the generic serialization framework API to use serialization-specific bytes instead
of Map<String,String> for configuration
> ----------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-6685
>                 URL: https://issues.apache.org/jira/browse/HADOOP-6685
>             Project: Hadoop Common
>          Issue Type: Improvement
>            Reporter: Owen O'Malley
>            Assignee: Owen O'Malley
>             Fix For: 0.22.0
>
>         Attachments: libthrift.jar, serial.patch, serial4.patch, serial6.patch, serial7.patch,
SerializationAtSummit.pdf
>
>
> Currently, the generic serialization framework uses Map<String,String> for the
serialization specific configuration. Since this data is really internal to the specific serialization,
I think we should change it to be an opaque binary blob. This will simplify the interface
for defining specific serializations for different contexts (MAPREDUCE-1462). It will also
move us toward having serialized objects for Mappers, Reducers, etc (MAPREDUCE-1183).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message