hadoop-common-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Chris Douglas (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HADOOP-6685) Change the generic serialization framework API to use serialization-specific bytes instead of Map<String,String> for configuration
Date Thu, 25 Nov 2010 02:42:20 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-6685?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12935617#action_12935617
] 

Chris Douglas commented on HADOOP-6685:
---------------------------------------

bq. This line of reasoning is overly general and could be used to support the addition of
literally any dependency (i.e. dependency x already exists, so it's OK to add y).

That wasn't how I read the argument. There exist some set of serialization frameworks and
Hadoop depends on some of them. Arguing that the existing set is final is a decision requiring
consensus. If adding serializations as default dependencies was permitted without this level
of debate in the past, it seems fair to ask why this particular dependency is worthy of a
veto. I don't know if everyone realizes that, by changing the list of serializations, the
thrift dependency becomes optional. Not optional to build Hadoop, but optional for the serialization
context. The protocol buffer dependency could be linked against the "lite" runtime, if that
would address some of the concerns.

bq. Hadoop will benefit greatly in the long term by promoting a single, default serialization
and file format for new users. I was under the impression that this was a shared goal and
that the chosen format was Avro.

The project never voted to adopt Avro as the one and only serialization, but nobody has challenged
the assertion that it should be supported as a first-class form. AFAIK, every contributor
advocates it. As Tom noted, the type is sufficient for protocol buffers, thrift, and writables,
but Avro requires the schema. This patch is general enough to support Avro (or anything else).
The interest in the bytes used to encode the type- granted, relying on protocol buffers, which
is a new dependency- is out of proportion to its importance. IMO, just tossing these data
into JSON is only solving half the problem; protocol buffers allows for extensions that could
open fruitful paths for experimentation.

It's news to me that Doug has decided not to integrate Avro into Hadoop. Some API changes
were vetoed months ago, but many- including "opponents" in this limited context- still discuss
it as a natural part of core data formats, as an RPC candidate, etc. Frankly, the PMC hasn't
discussed anything like a policy. It's disappointing to learn that he's withdrawn from this
community and elected to work on forwarding his vision in Avro, but the consequences of that
should be limited to his engagement. Part of what has fueled the worst of this thread has
been the complementary freezing of the formats and serializations hosted in Hadoop; such a
division between framework/"user" code makes sense given how Doug wants to engage, but that
model has not been agreed on by the PMC. Given a choice and not a mandate, I'm not convinced
he would find this community antagonistic to his goals, but he can spend his time and talent
as he chooses. If the Avro data file were included alongside TFile and SequenceFile, personally
I see no reason why we could not write the documentation that would help downstream developers
(including frameworks) help realize the benefits he foresees.

SequenceFile is an experimental format. As long as changes are compatible with existing data,
changes/enhancements to it have been tolerated to date. The concern that it distracts from
the format that ships with Avro is overblown; most users interact with data through frameworks,
and those frameworks are managing most of the compatibility headaches. The patch reaches no
further than previous modifications.

----

The existential question of the scope of Hadoop needs to be answered by the PMC, not navigated
by vetoes. The packaging question is part of a larger dependency problem the project needs
to answer more directly; the solution Scott advocates could make progress toward that, but
it's a separate issue. The dependency on protocol buffers for encoding the type is the only
substantial objection, as it commits that dependency to persistent data.

After taking a deep breath, this issue feels pretty big/little endian. How jobs are configured
in Hadoop is relevant, and this step moves away from the XML Configuration Hadoop has used
for years, but no direction proposed will cause the project to fail as spectacularly as the
mutual veto. Tom, Doug: if you're willing to drop conditions to your vetoes on all but the
protocol buffers vs JSON, could a technical discussion on the merits of those formats for
metadata get us around this impasse?

> Change the generic serialization framework API to use serialization-specific bytes instead
of Map<String,String> for configuration
> ----------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-6685
>                 URL: https://issues.apache.org/jira/browse/HADOOP-6685
>             Project: Hadoop Common
>          Issue Type: Improvement
>            Reporter: Owen O'Malley
>            Assignee: Owen O'Malley
>             Fix For: 0.22.0
>
>         Attachments: libthrift.jar, serial.patch, serial4.patch, serial6.patch, serial7.patch,
SerializationAtSummit.pdf
>
>
> Currently, the generic serialization framework uses Map<String,String> for the
serialization specific configuration. Since this data is really internal to the specific serialization,
I think we should change it to be an opaque binary blob. This will simplify the interface
for defining specific serializations for different contexts (MAPREDUCE-1462). It will also
move us toward having serialized objects for Mappers, Reducers, etc (MAPREDUCE-1183).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message