hadoop-common-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Doug Cutting (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HADOOP-6685) Change the generic serialization framework API to use serialization-specific bytes instead of Map<String,String> for configuration
Date Fri, 12 Nov 2010 23:29:17 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-6685?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12931566#action_12931566
] 

Doug Cutting commented on HADOOP-6685:
--------------------------------------

Owen, thanks for the slides.  I don't see a direct relation between this issue and the issue
of simplifying the implementation of efficient map-side joins (MAPREDUCE-1183, more or less).
 Am I missing the connection, or is this a distinct issue?

File formats are forever.  More variations add significant, long-term compatibility burdens
to the project.  We badly need to add support for a higher-level object serialization system
than Writable.  But I'm not convinced its wise to add such support to the exisiting Java-only
container file formats.  So I'm all for a more generic serialization API that can be used
by MapReduce applications.  I don't however see that it follows that we should provide implementations
of file formats with a large number of different serialization systems, as that invites multiplicative
long-term support issues.  I'd prefer that we instead direct users towards a single preferred
high-level serialization system and a single preferred container.  Historically that's been
Writable and SequenceFile.  We now need to migrate from these to a more expressive, language-independent
serialization system and container file.  Our APIs should be of course be general enough that
it's possible to incorporate different serialization systems and different file formats, but
we needn't provide implementations of all combinations of these, but should rather direct
folks towards a primary implementation.

Google benefits tremendously by having a single standard serialization system and container
file format.  The Dremel paper (http://sergey.melnix.com/pub/melnik_VLDB10.pdf) argues that
this is an essential enabler of their wide variety of interoperable systems.  The further
we depart from this the harder we make it to build systems like Dremel that multiply the utility
of stored data.

Changing serialization systems or file formats is a major imposition for many applications.
 They cannot afford to do it frequently.  We should provide a clear path forward from Writable+SequenceFile
to a new system that's easier to use, less fragile, and language-independent to better facilitate
a rich ecosystem of tools.

> Change the generic serialization framework API to use serialization-specific bytes instead
of Map<String,String> for configuration
> ----------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-6685
>                 URL: https://issues.apache.org/jira/browse/HADOOP-6685
>             Project: Hadoop Common
>          Issue Type: Improvement
>            Reporter: Owen O'Malley
>            Assignee: Owen O'Malley
>         Attachments: serial.patch, SerializationAtSummit.pdf
>
>
> Currently, the generic serialization framework uses Map<String,String> for the
serialization specific configuration. Since this data is really internal to the specific serialization,
I think we should change it to be an opaque binary blob. This will simplify the interface
for defining specific serializations for different contexts (MAPREDUCE-1462). It will also
move us toward having serialized objects for Mappers, Reducers, etc (MAPREDUCE-1183).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message