hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sanjay Radia <sra...@yahoo-inc.com>
Subject Hadoop 1.0 Compatibility Discussion.
Date Tue, 21 Oct 2008 01:50:00 GMT
The Hadoop 1.0 wiki has a section on compatibility.

Since the wiki is awkward for discussions, I am continuing the  
discussion here.
I or someone will update the wiki when agreements are reached.

Here is the current list of compatibility requirements on the Hadoop  
1.0 Wiki for the convenience of this email thread.
What does Hadoop 1.0 mean?
     * Standard release numbering: Only bug fixes in 1.x.y releases  
and new features in 1.x.0 releases.
     * No need for client recompilation when upgrading from 1.x to  
1.y, where x <= y
           o  Can't remove deprecated classes or methods until 2.0
      * Old 1.x clients can connect to new 1.y servers, where x <= y
     * New FileSystem clients must be able to call old methods when  
talking to old servers. This generally will be done by having old  
methods continue to use old rpc methods. However, it is legal to have  
new implementations of old methods call new rpcs methods, as long as  
the library transparently handles the fallback case for old servers.

A couple of  additional compatibility requirements:

* HDFS metadata and data is preserved across release changes, both  
major and minor. That is,
whenever a release is upgraded, the HDFS metadata from the old release  
will be converted automatically
as needed.

The above has been followed so far in Hadoop; I am just documenting it  
in the 1.0 requirements list.

   * In a major release transition [ ie from a release x.y to a  
release (x+1).0], a user should be able to read data from the cluster  
running the old version.  (OR shall we generalize this to: from x.y to  
(x+i).z ?)

The motivation: data copying across clusters is a common operation for  
many customers
(for example this is routinely at done at Yahoo.). Today, http (or  
hftp) provides a guaranteed compatible way of copying data across  
versions.  Clearly one cannot force a customer to simultaneously  
update all its hadoop clusters on to
a new major release. The above documents this requirement; we can  
satisfy it via the http/hftp mechanism or some other mechanism.

Question: is one is willing to break applications that operate across  
clusters (ie an application that accesses data across clusters that  
cross a major release boundary? I asked the operations team at Yahoo  
that run our hadoop clusters. We currently do not have any applicaions  
that access data across clusters as part  of a MR job. The reason  
being that Hadoop routinely breaks  wire compatibility across releases  
and so such apps would be very unreliable. However, the copying of  
data across clusters is t is crucial and needs to be supported.

Shall we add a stronger requirement for 1.0:  wire compatibility  
across major versions? This can be supported by class loading or other  
games. Note we can wait to provide this when 2.0 happens. If Hadoop  
provided this guarantee then it would allow customers to partition  
their data across clusters without risking apps breaking across major  
releases due to wire incompatibility issues.

View raw message