crunch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From David Whiting <davidwhit...@gmail.com>
Subject Re: Alternative strategy for incorporating Java 8 lambdas into Crunch
Date Wed, 16 Dec 2015 22:25:22 GMT
So I've done a few more iterations on this and I'm back with even more
comments (I really want to get this right, can you tell? :-P).

The "wrap the whole collection" approach gives you a really usable API, but
with minor inconvenience to the developer (wrapping PCollections into
LCollections before they can start using it). I tried both (a) making
LCollection<S> implement PCollection<S> and delegate _everything_ (wrapping
the relevant return types into LCollections) and (b) Having a cut-back
functional-first API (like Scrunch), which requires digging down to the
underlying PCollection and re-promoting afterwards to do more complex stuff.

The problem with (a) is that you have an unwieldy mega-interface in your
LThings, increasing your maintainence workload and requiring to be updated
with every change to PThings, and just shifting the problem with the
existing IDoFn/IMapFn implementation into the lambda project instead - not
really providing much of an improvement over the current master branch and
increasing code volume significantly.

The problem with (b) is that existing tools and library functions will
require unwrapping and wrapping to use, and the whole notion of
compatibility becomes awkward. We don't have the "luxury" of being able to
promote and demote types from wrapped versions implicitly like we do in
Scala, so it again becomes slightly awkward to use for when stuff gets more
complex. Also, one has to define what the subset of allowable operations
is, and that would end up being somewhat arbitrary. There was one
interesting development to come out of this investigation, and that's the
idea of cutting the size of the interface in half by not having separate
"String name," prefixed versions of every method, and instead making it a
separate method call to set it, ie:

someCollection.id("Only adult users").filter(user -> user.getAge() >= 18)
              .id("Extact username").by(User::getUsername, strings())

But anyway I digress, back to topic: That leaves me us with 3 different
versions of Java lambda support, all of which I find slightly unsatisfying
in their own way. The master branch implementation of IFns leads to
exploding PThings interfaces and the corresponding burden on implementers,
as well as not allowing other Java 8 language things; the
static-import-wrap solution leaves the code looking awkward and makes it
necessary to give lots of options kinda arbitrary names; and the "wrap the
whole interface" solution essentially creates a divergence in the APIs,
reduces compatibility with existing code and/or increases the maintainence
burden.

So, here comes option 4. Bear with me on this one, because it's a bit of an
shift in mindset, inspired in part by what Google did in the time between
the FlumeJava paper and DataFlow on GCP.

The fundamental reason it's difficult to implement this change is that we
can't just "add a few things" to the PCollection interface. If we extend it
via inheritance or composition we must extend and wrap everything so that
we return the extended type, and if we want to add things to the interface
itself, they must be implemented and propagated through all the
implementations. If we could use default methods in the PCollection
interface it would help, but there's a different approach we could take.

If we encapsulate the transformation itself (map, flatMap, filter,...) into
an object, comprising of a function to operate on the PCollection itself,
then everyone can extend this in whatever way they like. In a sense we add
an extension point so that people can add user-defined functions without
breaking the fluency of the interface (in the way that many of the library
functions now require you to).

This Gist probably explains it better than I can in words:
https://gist.github.com/DavW/02d935366bc610c1cbb3

The latter part of it is also very interesting because it solves another
problem. We can potentially (subject to a deprecation cycle) remove many of
methods that have been added to PThings for fluency reasons (like max(),
min(), first(), length(), ...) and just use them via Transforms instead,
keeping only lower-level primitives as part of the PCollection interface
itself, making it easier for people wishing to implement Crunch for future
data technologies (which will be crucial to the long-term survival of this
project).

I'm not sure that this is definitely better than all of the other 3
options, but consider this an RFC on the topic.





On 15 December 2015 at 10:34, Gabriel Reid <gabriel.reid@gmail.com> wrote:

> Yeah, looking at the two next to each other I'm going for the
> collections approach as well. +1
>
> On Tue, Dec 15, 2015 at 2:04 AM, Josh Wills <josh.wills@gmail.com> wrote:
> > On Mon, Dec 14, 2015 at 4:15 PM, David Whiting <davidwhiting@gmail.com>
> > wrote:
> >
> >> 1) Not at all, just some leftover working names for stuff.
> >>
> >> 2) Not for a totally minimal implementation, but some of the features I
> >> would like to include would rely on Java 8 things, for example adapting
> the
> >> GroupedTable stuff to use Streams rather than Iterables because of a)
> the
> >> extra expressivity and b) the implied once-only traversal. We could
> have a
> >> filterMap which applies a Function<S, Optional<T>> (my most common
use
> case
> >> for a DoFn instead of a MapFn at the moment). We can also potentially
> >> utilise Collectors for collapsing values in reduce-side stuff and
> finally,
> >> it'll make the implementation of it a fair bit easier. The maven
> overhead
> >> is pretty low, so I guess it's just the existence of an extra artifact
> to
> >> consider. The way I see it is that it's a push to make the API feel more
> >> like Java streams and be more immediately usable by someone who knows
> Java
> >> streams but not necessarily big data, so the more we can replicate that
> >> feel by integrating with other familiar Java 8 features, the better.
> >>
> >
> > Makes sense to me. +1 for a new crunch-lambda module.
> >
> >
> >>
> >> On 15 December 2015 at 00:51, Josh Wills <josh.wills@gmail.com> wrote:
> >>
> >> > I think I lean towards the collections approach, but that's probably
> >> > because of my Scrunch experience. Two questions:
> >> >
> >> > 1) Is mapToTable necessary? I would think map(SFunction, PTableType)
> >> would
> >> > be distinguishable from map(SFunction, PType) by the compiler in the
> same
> >> > way it is for parallelDo.
> >> > 2) Does the collections approach need a separate maven target at all,
> or
> >> > could it just be part of crunch-core as a replacement for the IFn
> stuff?
> >> Or
> >> > is there Java 8-only stuff we'll want to add in to its API?
> >> >
> >> > On Mon, Dec 14, 2015 at 3:13 PM, David Whiting <davw@apache.org>
> wrote:
> >> >
> >> > > Ok, so I've implemented a few iterations of this. I went forward
> with
> >> the
> >> > > "wrap the functions" method, which seemed to work alright, but
> finding
> >> > good
> >> > > names for functions which essentially just wrap functions but which
> >> > aren't
> >> > > ambiguous in erasure and read nicely was a real challenge. I showed
> >> some
> >> > > sample code to some of my fellow data engineers and the consensus
> >> seemed
> >> > to
> >> > > be that it was definitely better than anonymous inner classes, but
> it
> >> > still
> >> > > felt kind of awkward and strange to use.
> >> > >
> >> > > So here's a 3rd option: wrap the collection types rather than the
> >> > function
> >> > > types, and present an API which feels truly Java 8 native whilst
> still
> >> > > being able to dig back to the underlying PCollections (doing pretty
> >> much
> >> > > what Scrunch does, but with less implicit Scala magic).
> >> > >
> >> > > Here's a super-minimal proof-of-concept for that:
> >> > > https://gist.github.com/DavW/7efe484ea0c00cf6e66b
> >> > >
> >> > > and a comparison of the two approaches in usage:
> >> > > https://gist.github.com/DavW/997a92b31d55c5317fb7
> >> > >
> >> > >
> >> > > On 13 December 2015 at 16:14, Gabriel Reid <gabriel.reid@gmail.com>
> >> > wrote:
> >> > >
> >> > > > This looks very cool. As long as we can keep things compatible
> with
> >> > > > Java 7 using whatever kind of maven voodoo that's necessary,
I'm
> all
> >> > > > for it.
> >> > > >
> >> > > > I'd say no real reason to keep the IFn stuff if this goes in.
> >> > > >
> >> > > > - Gabriel
> >> > > >
> >> > > > On Fri, Dec 11, 2015 at 11:18 PM, Josh Wills <
> josh.wills@gmail.com>
> >> > > wrote:
> >> > > > > It seems like a net positive over the IFn stuff, so I could
> make an
> >> > > > > argument for replacing it, but if there's anyone out there
in
> love
> >> > > > w/IFns,
> >> > > > > they should speak up now. :)
> >> > > > >
> >> > > > > J
> >> > > > >
> >> > > > > On Fri, Dec 11, 2015 at 2:17 PM, David Whiting <davw@apache.org
> >
> >> > > wrote:
> >> > > > >
> >> > > > >> I *think* you can set language level and target jdk
on a
> >> per-module
> >> > > > basis,
> >> > > > >> so it should be relatively easy. I'll experiment at
some point
> >> over
> >> > > the
> >> > > > >> weekend. Would this complement or replace the I*Fn stuff
do you
> >> > think?
> >> > > > 14.0
> >> > > > >> is not yet released, so I guess it's not too late to
change if
> we
> >> > want
> >> > > > to.
> >> > > > >>
> >> > > > >> On 11 December 2015 at 22:57, Josh Wills <josh.wills@gmail.com
> >
> >> > > wrote:
> >> > > > >>
> >> > > > >> > That's the sexiest thing I've seen in some time.
+1 for a
> lambda
> >> > > > module,
> >> > > > >> > but how does that work in Maven-fu? Is it like
a conditional
> >> > compile
> >> > > > or
> >> > > > >> > something?
> >> > > > >> >
> >> > > > >> > On Fri, Dec 11, 2015 at 1:20 PM, David Whiting
<
> davw@apache.org
> >> >
> >> > > > wrote:
> >> > > > >> >
> >> > > > >> > > Oops, my bad. Here's a Gist:
> >> > > > >> > > https://gist.github.com/DavW/e2588e42c45ad8c06038
> >> > > > >> > >
> >> > > > >> > > On 11 December 2015 at 18:43, Josh Wills <
> >> josh.wills@gmail.com>
> >> > > > wrote:
> >> > > > >> > >
> >> > > > >> > > > I think it's kind of awesome, but the
attachment didn't
> go
> >> > > > through-
> >> > > > >> PR
> >> > > > >> > or
> >> > > > >> > > > gist?
> >> > > > >> > > > On Fri, Dec 11, 2015 at 7:42 AM David
Whiting <
> >> > davw@apache.org>
> >> > > > >> wrote:
> >> > > > >> > > >
> >> > > > >> > > > > While fixing the bug where the IFn
version of
> mapValues on
> >> > > > >> > > PGroupedTable
> >> > > > >> > > > > was missing, I got thinking that
this is quite an
> >> > inefficient
> >> > > > way
> >> > > > >> of
> >> > > > >> > > > > including support for lambdas and
method references,
> and
> >> it
> >> > > > still
> >> > > > >> > > didn't
> >> > > > >> > > > > actually support quite a few of
the features that would
> >> make
> >> > > it
> >> > > > >> easy
> >> > > > >> > to
> >> > > > >> > > > > code against.
> >> > > > >> > > > >
> >> > > > >> > > > > Negative parts of existing lambda
implementation:
> >> > > > >> > > > > 1) Explosion of already-crowded
PCollection, PTable and
> >> > > > >> PGroupedTable
> >> > > > >> > > > > interfaces, and having to implement
those methods in
> all
> >> > > > >> > > implementations.
> >> > > > >> > > > > 2) Not supporting flatMap to Optional
or Stream types.
> >> > > > >> > > > > 3) Not exposing convenient types
for reduce-type
> >> operations
> >> > > > (Stream
> >> > > > >> > > > > instead of Iterable, for example).
> >> > > > >> > > > >
> >> > > > >> > > > > Something that would solve all three
of these is to
> build
> >> > > lambda
> >> > > > >> > > support
> >> > > > >> > > > > as a separate artifact (so we can
use all java8 types),
> >> and
> >> > > > instead
> >> > > > >> > of
> >> > > > >> > > > the
> >> > > > >> > > > > API being directly on the PSomething
interfaces, we
> just
> >> > have
> >> > > > >> > > convenient
> >> > > > >> > > > > ways to wrap up lambdas into DoFns
or MapFns via
> >> > > > >> statically-imported
> >> > > > >> > > > > methods.
> >> > > > >> > > > >
> >> > > > >> > > > > The usage then becomes
> >> > > > >> > > > > import static org.apache.crunch.Lambda.*;
> >> > > > >> > > > > ...
> >> > > > >> > > > > someCollection.parallelDo(flatMap(d
-> someFnOf(d)),
> pt)
> >> > > > >> > > > > ...
> >> > > > >> > > > > otherGroupedTable.mapValue(reduce(seq
->
> seq.mapToInt(i ->
> >> > > > >> i).sum()),
> >> > > > >> > > > > ints())
> >> > > > >> > > > >
> >> > > > >> > > > > Where flatMap and reduce are static
methods on Lambda,
> and
> >> > > > Lambda
> >> > > > >> > goes
> >> > > > >> > > in
> >> > > > >> > > > > it's own artifact (to preserve compatibility
with 6
> and 7
> >> > for
> >> > > > the
> >> > > > >> > rest
> >> > > > >> > > of
> >> > > > >> > > > > Crunch).
> >> > > > >> > > > > I've attached a basic proof-of-concept
implementation
> >> which
> >> > > I've
> >> > > > >> > > tested a
> >> > > > >> > > > > few things with, and I'm very happy
to sketch out a
> more
> >> > > > >> substantial
> >> > > > >> > > > > implementation if people here think
it's a good idea in
> >> > > general.
> >> > > > >> > > > >
> >> > > > >> > > > > Thoughts? Ideas? Suggestions? Please
tell me if this is
> >> > crazy.
> >> > > > >> > > > >
> >> > > > >> > > > >
> >> > > > >> > > >
> >> > > > >> > >
> >> > > > >> >
> >> > > > >>
> >> > > >
> >> > >
> >> >
> >>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message