incubator-crunch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Josh Wills <>
Subject Re: Moving PType and friends
Date Mon, 29 Oct 2012 07:24:39 GMT

On Sun, Oct 28, 2012 at 2:39 AM, Matthias Friedrich <> wrote:

> Hi,
> On Saturday, 2012-10-27, Josh Wills wrote:
> > On Sat, Oct 27, 2012 at 2:41 PM, Matthias Friedrich <>
> wrote:
> [...]
> >> I'd like to avoid a CRUNCH-60 type of situation where I spend lots of
> >> time playing through scenarios and planning changes that, as it turns
> >> out, aren't welcome. That is *very* frustrating, to say the least, so
> >> let's talk ;-)
> > Yeah, that sucked, let's not do that again.
> > I think my main concern is that I don't have visibility into the scope
> > of API changes that come along with CRUNCH-60, so I don't really see
> > what we're building towards. I want to know what the end state of the
> > API looks like w/some amount of precision-- what will live where,
> > etc., etc.
> As far as I'm concerned, CRUNCH-60 is dead because we didn't want to
> create any more Maven modules, which is fine by me. My goal is to
> provide a clean, stable API to clients which means we have to decide
> which packages are client-facing (and thus, everything in there has to
> remain stable) and which packages are internal to Crunch and may be
> changed at any time (giving us the opportunity to improve things).
> With "stable" I mean that we can still add and remove stuff, but only
> in a way that doesn't break compatibility. We can only remove things
> that have been deprecated for at least two releases. What we can't do
> anymore is moving types around.
> Re. your question on precision: I don't believe in completely freezing
> the API. API design is hard and we don't have Josh Bloch on the team,
> so it's unlikely that we get everything right. But I do believe that
> we can adhere to the defintion of stable above.
> > I agree w/Gabriel that having PType and PTypeFamily in the base
> > package feels right and that having stuff that is essentially
> > implementation detail stuff like Converter and OutputHandler there
> > feels wrong. It may be that the value of having PType and PTypeFamily
> > in base outweighs the cost of having Converter and OutputHandler there
> > as well, but since I don't see the big picture here, it feels like
> > we're making a series of locally optimal decisions that don't
> > necessarily lead to a globally optimal API, where I'm using "optimal"
> > in the sense of one that is both cleanly separated as well as easy to
> > understand and use. Making a series of locally optimal decisions is
> > how I ended up coding into the cul-de-sac that necessitated the
> > OutputHandler abstraction in the first place.
> OK, I see the problem. As usual we want the same thing, mostly ;-)
> > If it's possible to spell out exactly what the APIs look like when
> > this is done, let's do that; if we're not there yet but could spell
> > out a set of principles that we would like to be true of the APIs,
> > then let's do that and then start creating some strawman proposals for
> > organization of code that satisfies them as closely as we can.
> > Implementing the changes required to get to the goals may take longer
> > than a single release cycle, but I wouldn't want it to take more than
> > two releases, just because I'm loathe to be in a process of
> > continually moving things around.
> Good idea, let's first agree on a set of principles. In my opinion,
> we should limit the scope for these prinicples to client-facing
> packages, everything else can be changed in any way at any time.
> My proposal is based on [2], a very short and incomplete summary can
> be found at [3]. For us, it boils down to this:
>  * A package must have a clear purpose; it contains either mostly
>    abstractions or mostly implementations (this makes it easier
>    to explain)
>  * A package must not depend on a package that is less stable
>    than itself (meaning a package containing mostly abstractions
>    must not depend on one containing mostly implementations)
>  * There must be no dependencies from a client-facing package to
>    an internal package (that is, javadocs don't have dangling
>    references)
>  * There must be tight cohesion between classes in a package or
>    the package should be split (this doesn't apply for .util)
>  * There must be no dependency cycles between client-facing packages

I agree with these principles, although I think that the first one (clear
purpose for a package) is often in conflict with the last one (dependency
cycles between client facing packages). Is there an implicit priority
scheme here? We're saying that having clear purpose for a package is more
important than having dependency cycles, or are we saying that the two are

> You can calculate metrics for all of this but it's really just common
> sense. Crunch follows these rules in the vast majority of cases
> already. Right now I see the following violations:
>  * The .types package mixes abstractions and implementations and
>    is part of a dependency cycle with base.
>  * The base package references the .io implementation package
>    causing a dependency cycle.
>  * The base package references the .util package causing a
>    dependency cycle.
>  * There are lots of implementations in CombineFn and other Fns
>    that shouldn't be in base (which is for abstractions). We should
>    move them to .fn, perhaps to Guava style CombineFns, FilterFns.
>    We can even do this in a backwards compatible way.

So of these, I think that the CombineFn -> CombineFns change is the easiest
fix, in that it solves the implementation issue for CombineFn and the
dependency of the base API on the util package. I am 100% behind that one.

Sorting out the cycle between io/types/base seems trickier to me and I
think that is the core of the design problem, and it goes right into the
tradeoff between clear purpose for a package and the dependency cycles
between client facing packages. Do you agree?

> Given that Crunch is a framework this is not really ambitious. My
> teams at work follow a lot more design rules even for applications.
> One example: We make everything final by default; extension takes
> place either by composition or by template method which makes it
> possible to evolve classes without breaking client code. That's
> what the DesignForExtension Checkstyle rule mandates (we disabled
> it for Crunch).
> Note: Java 8 will be a game changer; as far as I understand it, our
> abstract Fns are *not* going to work with Lambdas, we would need
> interfaces with only a single method. Does anyone think we have to
> address this before Java 8 is released?
> > I think that once we have a proposal we agree on, we should cut the
> > 0.4.0 release and then devote the next release cycle to implementing
> > the new API for 0.5.0.
> Agreed. BTW, I have verified our release process (with a local Nexus
> and all), we're ready to release at any time.
> So, what do you think of the principles outlined above?
> Regards,
>   Matthias
> [1]
> [2] Robert C. Martin, "Agile Principles, Patterns, and Practices",
>     chapter "Principles of Package and Component Design".
> [3]

Director of Data Science
Cloudera <>
Twitter: @josh_wills <>

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message