hadoop-general mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Steve Loughran <ste...@apache.org>
Subject Re: [VOTE] Direction for Hadoop development
Date Wed, 01 Dec 2010 12:25:56 GMT
On 29/11/10 22:30, Owen O'Malley wrote:
> All,
> Based on the discussion on HADOOP-6685, there is a pretty fundamental
> difference of opinion about how Hadoop should evolve. We need to figure
> out how the majority of the PMC wants the project to evolve to
> understand which patches move us forward. Please vote whether you
> approve of the following direction. Clearly as the author, I'm +1.
>
> -- Owen
>
> Hadoop has always included library code so that users had a strong
> foundation to build their applications on without needing to continually
> reinvent the wheel. This combination of framework and powerful library
> code is a common pattern for successful projects, such as Java, Lucene,
> etc. Toward that end, we need to continue to extend the Hadoop library
> code and actively maintain it as the framework evolves. Continuing
> support for SequenceFile and TFile, which are both widely used is
> mandatory. The opposite pattern of implementing the framework and
> letting each distribution add the required libraries will lead to
> increased community fragmentation and vendor lock in.
>
> Hadoop's generic serialization framework had a lot of promise when it
> was introduced, but has been hampered by a lack of plugins other than
> Writables and Java serialization. Supporting a wide range of
> serializations natively in Hadoop will give the users new capabilities.
> Currently, to support Avro or ProtoBuf objects mutually incompatible
> third party solutions are required. It benefits Hadoop to support them
> with a common framework that will support all of them. In particular,
> having easy, out of the box support for Thrift, ProtoBufs, Avro, and our
> legacy serializations is a desired state.
>
> As a distributed system, there are many instances where Hadoop needs to
> serialize data. Many of those applications need a lightweight, versioned
> serialization framework like ProtocolBuffers or Thrift and using them is
> appropriate. Adding dependences on Thrift and ProtocolBuffers to the
> previous dependence on Avro is acceptable.


I'm happy with new build-time dependencies on these libraries, with one 
big warning. Until an official, non-incubation release of Thrift comes 
out (and thrift moves from incubation), the Apache Management will veto 
any redistribution of the thrift JARs; they aren't signed off as for 
public use.

I'm not so sure about more runtime depencencies that go all the way into 
the classpath of the things working with HDFS, or files created in it, 
because that leads to version problems in private code. [Inevitably 
Hadoop will end up adopting for some OSGi-like classpath setup, but I'm 
not pushing for that as it has its own interesting issues].

At the same time -you can't add features without adding dependencies 
except by playing rebasing tricks, and I have mixed feelings about those 
tricks:
   good: lets the hadoop team push things out on their schedule
   bad: impossible to push out security bug fixes to dependent libraries 
without rebuilding and re-releasing things. Your ops team will hate you.

For the bad reason, and because it's extra work, I avoid playing 
rebasing games, just try and do classpaths right in the first place 
-which is easier said than done.

One part of the HADOOP-6685 discussion raised was JSON as a format for 
things. Adopting JSON -and deciding which JSON parser to use- is trouble.


Ignoring the ongoing discussion of serialization formats, the question 
"should we use JSON?" really leads back to "which external JSON parser 
do we want to use?", which is a separate -and significant problem.

I say this as someone who has three separate json parsers on the runtime 
classpath of something whose functional tests are failing in a hudson 
window blinking at me alongside this email application.

gson: http://code.google.com/p/google-gson/
  http://mvnrepository.com/artifact/com.google.code.gson/gson/1.4
  com.google.code.gson/gson-1.5.1; no runtime dependencies
-some people like the seamless binding to java objects, which I view as 
repeating the same mistakes as WS-*.

json-lib: http://json-lib.sourceforge.net/
http://mvnrepository.com/artifact/net.sf.json-lib/json-lib/2.3
at runtime tends to need the usual commons-logging back end and
  net.sf.json-lib/json-lib-2.3
  net.sf.ezmorph/ezmorph-1.06
  commons-lang-2.4
  commons-collections-3.2.1
-low level, DOM-ish, could be improved to be more Java-5-intuitive

Jackson: http://jackson.codehaus.org
org.codehaus.jackson/jackson-core-asl-1.6.2
org.codehaus.jackson/jackson-asl/0.9.5

Now, before someone points out that three JSON parsers is too many, this 
same code has log4J, SLF4J (with a back end to JSCL), a patched back end 
logger for Jetty to avoid SLF4J where possible, and a custom JCL 
back-end. XML side there's xerces and xalan instead of the JVM versions, 
and hibernate pulling in dom4j alongside. Test runs add htlmunit to the 
classpath, which pulls in the older httpclient libs, along with the 
http-core stuff I've switched to.

Java library versions -while more manageable than native library 
versions- are a pain. Regardless of the ugliness of XML or the 
mediocrity of DOM, running over to JSON just because DOM is unwieldy is 
replacing one source of trouble for another.

If Hadoop is going to use JSON in places, then the discussion/decision 
about which JSON parser to stick on the classpath is worthy of a JIRA 
issue all of its own.

-steve

(returning to his failing tests)




Mime
View raw message