incubator-crunch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Josh Wills <>
Subject Re: Focus of our next release?
Date Sun, 16 Sep 2012 01:21:53 GMT

On Sat, Sep 15, 2012 at 2:55 AM, Matthias Friedrich <> wrote:

> Hi,
> On Friday, 2012-09-14, Josh Wills wrote:
> > I like the idea of having themes for releases. In my head, the theme of
> > this release could be either
> >
> > a) Hacking the new MSCRPlanner code, esp. to add the ability to fuse
> > different MSCR jobs into a single instance that it enables, or
> > b) data access/integration points-- things like solr, hcatalog, hbase,
> > cassandra, jdbc, etc. as input and output sources for Crunch pipelines,
> or
> > c) API refactoring-- the crunch-api/crunch-impl/crunch-lib split, or
> I would see c) as part of a larger mission for improving documentation
> and usability. An immediate benefit would be that we don't have to
> provide javadoc for each and every class, only for those packages that
> are client-facing. Higher perceived quality with less work for us.

See, it's all about framing the change in terms of something I want. I
don't want to write javadoc, hence, I am very much in favor of changes that
don't require me to write it. ;-)

All kidding aside, I think that javadoc for the internal stuff is important
to, if only to attract new committers who want to hack on the guts of the
system as well as use the client APIs.

> I wouldn't make it a separate release though, perhaps we can do this
> in a series of smaller steps, starting with the crunch-api split.
> Refactorings like this usually turn into long frustrating monster
> tasks that prevent other progress. I'd really like to avoid that.
> But, before spending any more time on this, I think we should all
> agree that this is what we really want. Somehow I got the impression
> that you're not fully convinced that this refactoring is necessary or
> even a good idea. To me it would feel like people trying to rearrange
> the furniture in my living room. Let's discuss this here before we
> produce any more patches.

So I think that having no dependencies between lib/* and impl/* that don't
go through crunch-api is a positive thing. I think that if you hear
hesitation in my voice, it's because I am expedient to a fault, and am
often happy to circumvent clean lines of demarcation in the interest of
getting something working *right now,* and am contemplating how guilty I
would feel doing that in a world where these pieces are broken across
separate modules. My happiest days are the ones where I end up adding new
functionality or generalizing some abstraction in a new way, and I only
really think about refactoring and modularization when I hit a point where
the poor quality of the code is blocking me from adding something I really
want (viz., the refactoring of MSCRPlanner as a precursor to adding MSCR

I worry about the system becoming overly modular/abstracted. For example,
YARN took me awhile to figure out when I was writing Kitten, in no small
part b/c there are so many modules to go through before I could figure out
how everything hung together. I think that having a ton of different
modules to wade through in search of understanding is a barrier to
adoption-- at least, to adoption by people like me who like to poke at
stuff. I'd want to have some discussion around how deep the rabbit hole
goes here.

For example, say we added streaming data support, so that we could have
pipelines that operated on streams as well as batch input data. Clearly,
this will necessitate some API changes to DoFns in order to support things
that only make sense in a streaming context, and it's unlikely that there
would be any overlap between the lib/* and impl/* functionality that would
be applicable to streaming and batch contexts. So would we end up with:

crunch-core-api (shared between batch and stream, e.g., DoFns, MapFns, etc.)
crunch-batch-api (PCollection and PTable and friends)
crunch-stream-api (PStream, etc.)

? And if so, do we want to rename the modules over time to reflect their
new, more-specific functionality? We go towards crunch-hbase-batch and
crunch-hbase-streaming and crunch-solr-batch and crunch-solr-streaming, or
do we have top-level core, batch, and streaming modules w/the
extension-specific submodules underneath them?

I kind of doubt that this sort of hypothesizing is useful, but I'm somewhat
sleep-deprived and this is what is on my mind.

> > d) working on a PStream API that would let people apply DoFns to streams
> > and would build on top of things like WalMart's mupd8 or Storm or
> whatever.
> >
> > Of course, this is in addition to whatever fixes and new lib functions we
> > want to add over time. I don't want anything heavyweight, but those are
> > some of the larger-scale things that we'll need to tackle as a community,
> > and I would think of completing each of those big things as corresponding
> > to a release.
> Sounds good to me.
> Regards,
>   Matthias

Director of Data Science
Cloudera <>
Twitter: @josh_wills <>

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message