crunch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Vinod Kumar Vavilapalli <>
Subject Re: Focus of our next release?
Date Sun, 16 Sep 2012 02:31:42 GMT

Great thread, the kind I wanted to start.

+1 for some sort of release plan. We can either do time-based plans or feature based plans,
but we should pick one instead of doing both in parallel.

+1 for refactoring done in small steps. While I can understand how it can affect ongoing progress,
I strongly believe that not doing it will hurt us even more.

> I worry about the system becoming overly modular/abstracted. For example,
> YARN took me awhile to figure out when I was writing Kitten, in no small
> part b/c there are so many modules to go through before I could figure out
> how everything hung together. I think that having a ton of different
> modules to wade through in search of understanding is a barrier to
> adoption-- at least, to adoption by people like me who like to poke at
> stuff. I'd want to have some discussion around how deep the rabbit hole
> goes here.

Let me harp on this for a while, especially given I am responsible for the source structure
there :) I completely understand your pain, as I've heard from others too. But I argue that
the solution isn't to have a monolithic code base.

The reason why I started the original discussion of the split is this: I wanted to see how
can start writing my own a crunch example - from the POV of a crunch user. I started looking
for the APIs and it turned it to be difficult, with api and implementation all woven together
- just like Hadoop MapReduce if you ask me. Sure you could write more docs(which is a welcome
effort BTW), but giving an immediate feedback explaining what is part of the API and what
isn't, what methods are for consumption for users and what are the impl details that can change
anytime, what API is really public and what isn't (arguably this is a java limitation of how
package and non-package visibility is defined, but we are stuck with this).

That said, I agree that we need to hit a sweet spot here - just enough modularity to make
APIs, libs and impl to make things easy for users and for evolving each of them cleanly but
not much to the point that it becomes intractable for developers.

> For example, say we added streaming data support, so that we could have
> pipelines that operated on streams as well as batch input data. Clearly,
> this will necessitate some API changes to DoFns in order to support things
> that only make sense in a streaming context, and it's unlikely that there
> would be any overlap between the lib/* and impl/* functionality that would
> be applicable to streaming and batch contexts. So would we end up with:
> crunch-core-api (shared between batch and stream, e.g., DoFns, MapFns, etc.)
> crunch-batch-api (PCollection and PTable and friends)
> crunch-stream-api (PStream, etc.)
> crunch-batch-impl
> crunch-batch-lib
> crunch-stream-impl
> crunch-stream-lib
> ? And if so, do we want to rename the modules over time to reflect their
> new, more-specific functionality? We go towards crunch-hbase-batch and
> crunch-hbase-streaming and crunch-solr-batch and crunch-solr-streaming, or
> do we have top-level core, batch, and streaming modules w/the
> extension-specific submodules underneath them?

I don't know much on this, but I thought significant parts of current crunch code base is
all batch oriented: the apis, the plan optimizations etc. Do you think, if we wish to do something
streaming oriented, the APIs will remain the same?

Irrespective of that, you could have a simpler organization:
 - a top-level crunch-batch and crunch-stream
 - and crunch-*/lib crunch-*/impl.


  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message