hadoop-general mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Steve Loughran <ste...@apache.org>
Subject Re: Defining Hadoop Compatibility -revisiting-
Date Mon, 16 May 2011 14:11:18 GMT
On 16/05/11 13:00, Segel, Mike wrote:
> But Cloudera's release is a bit murky.
>
> The math example is a bit flawed...
>
> X represents the set of stable releases.
> Y represents the set of available patches.
> C represents the set of Cloudera releases.
>
> So if C contains a release X(n) plus a set of patches that is contained in Y,
> Then does it not have the right to be considered Apache Hadoop?
> It's my understanding is that any enhancement to Hadoop is made available to Apache and
will eventually make it into a later release...

It certainly contains it.

Now, if you want to make life more complex:
-view the contributions to the code base as a series of patches P1...Pn, 
each of which changes the code.
-These patches are essentially functions that transform the source S to 
a new state S'.
-the initial state of the source codebase is S0.

Hypothesis: the order in which the patch functions are applied 
determines the final state of the source tree.

If patches P1 and P2 were applied in order, you would get a state

S' = P2(P1(S0))

Applying the patches in a different order, you get a new final state.
S'' = P1(P2(S0))


Question for the maths people then is: can you be sure that S' and S'' 
are the same. As it would seem to me that it depends on the nature of 
the function. It could be that the set of functions that SVN supports 
guarantees sameness, but given conflict resolution problems I've 
encountered in the past, I doubt this.

Assuming that my belief holds: that the order in which a series of SVN 
patches are executed determines the final state of the source tree, then 
saying the patch sets -the set of functions applied to the source- of 
two codebases are equivalent does not mean the final state of the code 
is the same unless the sequence of application is also the same.

That would then define an apache release as a strictly ordered sequence 
of patches, or at least an sequence of operations that leads to the same 
final code state, such as S0.20.3

(oh look, I've just written a formal definition of what a release is, 
though I've avoided defining what a function is. View them as planar 
projections in cartesian space or something)


>
> So while it may not be 'official' release X(z), all of it's components are in Apache.
> (note: I'm talking about the core components and not Cloudera's additional toolsets that
encompass Hadoop.)
>
> Cloudera is clearly a derivative work.
> And IMHO is the only one which can say ... 'Includes Apache Hadoop'.

Once you start thinking about the ordering of the patch functions it 
gets complicated.

> That doesn't mean that others can't, depending on how they implemented their changes.

yes, though again it depends on the sequence of functions applied to the 
released sourcecode, such as S0.20.3, to the version they ship.

> So it wouldn't be a superset since it doesn't contain a complete subset, but contains
code that implements the API... So they can't say 'Includes Apache Hadoop',but they can say
it's a derivative work based on Apache Hadoop and then go on to show how and why, in their
opinion their product is better.(that's marketing for you...)

I agree

> Fragmentation of Hadoop will occur. It's inevitable. Too much money is on the table...

Clearly, but there are still some questions we can resolve here
  -what do they call their products?
  -how can they support assertions that their code is compatible if the 
series of patches they have applied to the codebase are not externally 
visible?
  -what are the concerns of the community about naming and branching?


>
> But because Apache's licensing is so open, Apache will have a hard time controlling derivative
works...

The Apache license permits anyone to fork and take that fork in house or 
closed source. Most people are considered daft to do this except for 
quick fixes, because any closed source takes on the task of writing the 
functions needed to transform it from the released state to one that 
matches customer needs. (i.e. the working state)


> I believe that Steve is incorrect in his assertion concerning potential loss of any patent
protection. Again Apache's licensing is very open and as long as they follow Apache's Ts and
Cs, they are covered.

Possibly. I avoid such legal issues.

-steve

Mime
View raw message