incubator-crunch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Gabriel Reid <gabriel.r...@gmail.com>
Subject Re: Focus of our next release?
Date Sun, 16 Sep 2012 19:48:11 GMT
On Sun, Sep 16, 2012 at 9:42 AM, Matthias Friedrich <matt@mafr.de> wrote:
> Hi,
>
> On Saturday, 2012-09-15, Josh Wills wrote:
>> On Sat, Sep 15, 2012 at 2:55 AM, Matthias Friedrich <matt@mafr.de> wrote:
> [...]
>
>> I worry about the system becoming overly modular/abstracted. For example,
>> YARN took me awhile to figure out when I was writing Kitten, in no small
>> part b/c there are so many modules to go through before I could figure out
>> how everything hung together. I think that having a ton of different
>> modules to wade through in search of understanding is a barrier to
>> adoption-- at least, to adoption by people like me who like to poke at
>> stuff. I'd want to have some discussion around how deep the rabbit hole
>> goes here.
>
> I see what you mean, and I think there has been a misunderstanding. To
> me, modularization doesn't mean we have to put everything into a
> separate Maven module. Given Java's current limitations that's what is
> often done, but this is just one way of doing it. I know Maven
> multi-module projects are cumbersome to work with (hell, I don't even
> *like* Maven). I would use a Maven module when there are different
> sets of dependencies (like with HBase) or when I need to create
> separate artifacts.
>
> My primary concern is to separate interfaces/abstractions and
> implementation on a package level. This way we can easily exclude
> implementation stuff from the user Javadocs, limiting the conceptional
> surface of Crunch significantly (see [1] for how big the system looks
> from a user perspective). Of course, package-level dependencies
> shouldn't contain cycles, which in most cases is done by making sure
> that abstractions don't depend on implementations. Most people do
> this by instinct, in Crunch it's correct in the vast majority of
> cases. There are just a few misplaced classes and a few times
> implementations and abstractions are mixed inside a single class. This
> makes the dependency graph look like a mess (and Javadoc links would
> point to nirvana), but it's all fixable.


This sounds good to me, but now I think that there was possibly indeed
a misunderstanding here, as I also had the feeling that the goal was
to split out a lot of things into different modules. Vinod, can you
confirm that your idea on this is in line with what Matthias is
talking about here.

I can definitely see the point that is being made with the huge number
of packages available in the user javadoc, while the number of public
packages is much smaller.

In general (and based on my experience), the way to make it easy to
get started with Crunch is have a few clear examples on the
website/wiki, as these are usual starting point for most developers.

However, once you get past the stage of making things work, I found
that it was really easy to miss out existing functionality (and
reimplement it myself) because the API docs do indeed contain way too
much stuff. Where I'm going with this is that I don't think the
current situation gets in the way of getting started with Crunch, but
it is detrimental to being really efficient with Crunch once you get
past the "getting started" phase.

Back on the topic of splitting things into modules (which appears to
not really be the focus now if I understand correctly), I have had
experience with projects that went very far with this (GeoTools [1] is
a good example of this), and I found that it made it *really*
difficult to get started with those projects, and definitely scared a
lot of people away from them.

>
>> For example, say we added streaming data support, so that we could
>> have
>> pipelines that operated on streams as well as batch input data. Clearly,
>> this will necessitate some API changes to DoFns in order to support things
>> that only make sense in a streaming context, and it's unlikely that there
>> would be any overlap between the lib/* and impl/* functionality that would
>> be applicable to streaming and batch contexts. So would we end up with:
> [...]
>
> Hmm, maybe we should discuss the streaming stuff on a separate thread.
> I'm not sure what you want to achieve (real time stream processing,
> CEP, ...?) or if it really makes sense to implement this as part of
> Crunch at all, but the number of Maven modules looks excessive :)

I'm also not too sure about the streaming support, and I also like the
idea of a separate thread. Sounds very interesting, but it also sounds
like it's going outside the scope of Crunch (or at least the scope
that I see for Crunch).

- Gabriel


[1] http://geotools.org

Mime
View raw message