incubator-crunch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Gabriel Reid <>
Subject Re: Focus of our next release?
Date Sun, 16 Sep 2012 19:30:22 GMT
On Fri, Sep 14, 2012 at 7:23 PM, Matthias Friedrich <> wrote:
> should we discuss the focus of our next release? Maybe make a list
> of things we want to achieve? Or would this be too much process?

Sorry for being so slow on the reply to this, I'm in a bit of a
vacation mode with spotty internet coverage. I'll just post my top
priorities here before jumping into the other discussions.

For me, the focus on the short-term to mid-term for Crunch should be
to improve usability and stability -- looking at the big picture, I'd
say that a kind of mission statement for Crunch could be to eliminate
the need to ever write another mapper or reducer ever again, while
being the defacto way to work with Hadoop MapReduce. From my own
perspective, this goal is already largely achieved, but I think that
convincing a larger population of developers will take a bit more

The main steps that I see for this are:

1. Make it easier to get started with Crunch. This has already been
discussed further down in this thread, and there are different ways to
achieve it, but I think that the common theme is improve documentation
and API clarity/simplicity.

2. Get rid of the "big" bugs. This is an obvious one, and I'm not sure
of how many there are still lurking in Crunch now, but the incorrect
total-order sorting and object-reuse bugs that have been dealt with
recently are the kinds of "big" bugs that I'm mostly worried about, as
they have major implications and are easy to get burnt by without
noticing it.

3. Make Crunch more pluggable, and therefore easier to migrate to --
as in, make it possible to plug in existing Mapper and Reducer
plugins, as well as making it easier to plug in existing InputFormats
and OutputFormats.

4. Add some handy little things like more clever input and output file
handling -- for example, allow giving a glob pattern and a directory
as input, with Crunch finding the correct input files by recursively
searching the input directory. Another example of this is better
handling for output file names (to abstract away the default Hadoop

My reasoning behind all of this is that the easier Crunch is to use,
the more it will get used, with all good things that come with that
being the natural result.

I'm not sure if these steps comprise a real focus for a next release,
or more fit in the category of miscellaneous new additions.

- Gabriel

View raw message