incubator-crunch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Matthias Friedrich <>
Subject Re: Focus of our next release?
Date Sun, 16 Sep 2012 07:42:57 GMT

On Saturday, 2012-09-15, Josh Wills wrote:
> On Sat, Sep 15, 2012 at 2:55 AM, Matthias Friedrich <> wrote:
>> I would see c) as part of a larger mission for improving documentation
>> and usability. An immediate benefit would be that we don't have to
>> provide javadoc for each and every class, only for those packages that
>> are client-facing. Higher perceived quality with less work for us.
> See, it's all about framing the change in terms of something I want. I
> don't want to write javadoc, hence, I am very much in favor of changes that
> don't require me to write it. ;-)
> All kidding aside, I think that javadoc for the internal stuff is important
> to, if only to attract new committers who want to hack on the guts of the
> system as well as use the client APIs.
For internal stuff I'm happy when each class has a short mission
statement. In my opinion, *Published* APIs should be fully documented,
down to the method level. But there should be relatively few published
> So I think that having no dependencies between lib/* and impl/* that don't
> go through crunch-api is a positive thing. I think that if you hear
> hesitation in my voice, it's because I am expedient to a fault, and am
> often happy to circumvent clean lines of demarcation in the interest of
> getting something working *right now,* and am contemplating how guilty I
> would feel doing that in a world where these pieces are broken across
> separate modules. My happiest days are the ones where I end up adding new
> functionality or generalizing some abstraction in a new way, and I only
> really think about refactoring and modularization when I hit a point where
> the poor quality of the code is blocking me from adding something I really
> want (viz., the refactoring of MSCRPlanner as a precursor to adding MSCR
> fusion.)

Great, that is really helpful information :)

You probably already suspect that my philosophy is a little different.
A big part of my daily work as a software architect is making sure
that software stays maintainable in the long run and I use each and
every tool from my software engineering toolbox to do this. I know
from experience how difficult and time-consuming it is to get a
derailed project back on track, so I'd like to prevent this from
happening with Crunch while it's still small and relatively easy to

I think to you, the current code base is a basis for exciting new
things to come -- you've outlined a few. My approach would have been
to get what we have into shape so that it's easy and pleasant to use.
Vinod mentioned a few of the problems he faced and I recognized them
from my first tries with Crunch. It would be great if we could solve
these problems without locking down the code base in a way that makes
evolving Crunch impossible.

> I worry about the system becoming overly modular/abstracted. For example,
> YARN took me awhile to figure out when I was writing Kitten, in no small
> part b/c there are so many modules to go through before I could figure out
> how everything hung together. I think that having a ton of different
> modules to wade through in search of understanding is a barrier to
> adoption-- at least, to adoption by people like me who like to poke at
> stuff. I'd want to have some discussion around how deep the rabbit hole
> goes here.

I see what you mean, and I think there has been a misunderstanding. To
me, modularization doesn't mean we have to put everything into a
separate Maven module. Given Java's current limitations that's what is
often done, but this is just one way of doing it. I know Maven
multi-module projects are cumbersome to work with (hell, I don't even
*like* Maven). I would use a Maven module when there are different
sets of dependencies (like with HBase) or when I need to create
separate artifacts.

My primary concern is to separate interfaces/abstractions and
implementation on a package level. This way we can easily exclude
implementation stuff from the user Javadocs, limiting the conceptional
surface of Crunch significantly (see [1] for how big the system looks
from a user perspective). Of course, package-level dependencies
shouldn't contain cycles, which in most cases is done by making sure
that abstractions don't depend on implementations. Most people do
this by instinct, in Crunch it's correct in the vast majority of
cases. There are just a few misplaced classes and a few times
implementations and abstractions are mixed inside a single class. This
makes the dependency graph look like a mess (and Javadoc links would
point to nirvana), but it's all fixable.

> For example, say we added streaming data support, so that we could
> have
> pipelines that operated on streams as well as batch input data. Clearly,
> this will necessitate some API changes to DoFns in order to support things
> that only make sense in a streaming context, and it's unlikely that there
> would be any overlap between the lib/* and impl/* functionality that would
> be applicable to streaming and batch contexts. So would we end up with:

Hmm, maybe we should discuss the streaming stuff on a separate thread.
I'm not sure what you want to achieve (real time stream processing,
CEP, ...?) or if it really makes sense to implement this as part of
Crunch at all, but the number of Maven modules looks excessive :)

View raw message