crunch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From David Whiting <davidwhit...@gmail.com>
Subject Re: Alternative strategy for incorporating Java 8 lambdas into Crunch
Date Tue, 15 Dec 2015 00:15:21 GMT
1) Not at all, just some leftover working names for stuff.

2) Not for a totally minimal implementation, but some of the features I
would like to include would rely on Java 8 things, for example adapting the
GroupedTable stuff to use Streams rather than Iterables because of a) the
extra expressivity and b) the implied once-only traversal. We could have a
filterMap which applies a Function<S, Optional<T>> (my most common use case
for a DoFn instead of a MapFn at the moment). We can also potentially
utilise Collectors for collapsing values in reduce-side stuff and finally,
it'll make the implementation of it a fair bit easier. The maven overhead
is pretty low, so I guess it's just the existence of an extra artifact to
consider. The way I see it is that it's a push to make the API feel more
like Java streams and be more immediately usable by someone who knows Java
streams but not necessarily big data, so the more we can replicate that
feel by integrating with other familiar Java 8 features, the better.

On 15 December 2015 at 00:51, Josh Wills <josh.wills@gmail.com> wrote:

> I think I lean towards the collections approach, but that's probably
> because of my Scrunch experience. Two questions:
>
> 1) Is mapToTable necessary? I would think map(SFunction, PTableType) would
> be distinguishable from map(SFunction, PType) by the compiler in the same
> way it is for parallelDo.
> 2) Does the collections approach need a separate maven target at all, or
> could it just be part of crunch-core as a replacement for the IFn stuff? Or
> is there Java 8-only stuff we'll want to add in to its API?
>
> On Mon, Dec 14, 2015 at 3:13 PM, David Whiting <davw@apache.org> wrote:
>
> > Ok, so I've implemented a few iterations of this. I went forward with the
> > "wrap the functions" method, which seemed to work alright, but finding
> good
> > names for functions which essentially just wrap functions but which
> aren't
> > ambiguous in erasure and read nicely was a real challenge. I showed some
> > sample code to some of my fellow data engineers and the consensus seemed
> to
> > be that it was definitely better than anonymous inner classes, but it
> still
> > felt kind of awkward and strange to use.
> >
> > So here's a 3rd option: wrap the collection types rather than the
> function
> > types, and present an API which feels truly Java 8 native whilst still
> > being able to dig back to the underlying PCollections (doing pretty much
> > what Scrunch does, but with less implicit Scala magic).
> >
> > Here's a super-minimal proof-of-concept for that:
> > https://gist.github.com/DavW/7efe484ea0c00cf6e66b
> >
> > and a comparison of the two approaches in usage:
> > https://gist.github.com/DavW/997a92b31d55c5317fb7
> >
> >
> > On 13 December 2015 at 16:14, Gabriel Reid <gabriel.reid@gmail.com>
> wrote:
> >
> > > This looks very cool. As long as we can keep things compatible with
> > > Java 7 using whatever kind of maven voodoo that's necessary, I'm all
> > > for it.
> > >
> > > I'd say no real reason to keep the IFn stuff if this goes in.
> > >
> > > - Gabriel
> > >
> > > On Fri, Dec 11, 2015 at 11:18 PM, Josh Wills <josh.wills@gmail.com>
> > wrote:
> > > > It seems like a net positive over the IFn stuff, so I could make an
> > > > argument for replacing it, but if there's anyone out there in love
> > > w/IFns,
> > > > they should speak up now. :)
> > > >
> > > > J
> > > >
> > > > On Fri, Dec 11, 2015 at 2:17 PM, David Whiting <davw@apache.org>
> > wrote:
> > > >
> > > >> I *think* you can set language level and target jdk on a per-module
> > > basis,
> > > >> so it should be relatively easy. I'll experiment at some point over
> > the
> > > >> weekend. Would this complement or replace the I*Fn stuff do you
> think?
> > > 14.0
> > > >> is not yet released, so I guess it's not too late to change if we
> want
> > > to.
> > > >>
> > > >> On 11 December 2015 at 22:57, Josh Wills <josh.wills@gmail.com>
> > wrote:
> > > >>
> > > >> > That's the sexiest thing I've seen in some time. +1 for a lambda
> > > module,
> > > >> > but how does that work in Maven-fu? Is it like a conditional
> compile
> > > or
> > > >> > something?
> > > >> >
> > > >> > On Fri, Dec 11, 2015 at 1:20 PM, David Whiting <davw@apache.org>
> > > wrote:
> > > >> >
> > > >> > > Oops, my bad. Here's a Gist:
> > > >> > > https://gist.github.com/DavW/e2588e42c45ad8c06038
> > > >> > >
> > > >> > > On 11 December 2015 at 18:43, Josh Wills <josh.wills@gmail.com>
> > > wrote:
> > > >> > >
> > > >> > > > I think it's kind of awesome, but the attachment didn't
go
> > > through-
> > > >> PR
> > > >> > or
> > > >> > > > gist?
> > > >> > > > On Fri, Dec 11, 2015 at 7:42 AM David Whiting <
> davw@apache.org>
> > > >> wrote:
> > > >> > > >
> > > >> > > > > While fixing the bug where the IFn version of
mapValues on
> > > >> > > PGroupedTable
> > > >> > > > > was missing, I got thinking that this is quite
an
> inefficient
> > > way
> > > >> of
> > > >> > > > > including support for lambdas and method references,
and it
> > > still
> > > >> > > didn't
> > > >> > > > > actually support quite a few of the features that
would make
> > it
> > > >> easy
> > > >> > to
> > > >> > > > > code against.
> > > >> > > > >
> > > >> > > > > Negative parts of existing lambda implementation:
> > > >> > > > > 1) Explosion of already-crowded PCollection, PTable
and
> > > >> PGroupedTable
> > > >> > > > > interfaces, and having to implement those methods
in all
> > > >> > > implementations.
> > > >> > > > > 2) Not supporting flatMap to Optional or Stream
types.
> > > >> > > > > 3) Not exposing convenient types for reduce-type
operations
> > > (Stream
> > > >> > > > > instead of Iterable, for example).
> > > >> > > > >
> > > >> > > > > Something that would solve all three of these
is to build
> > lambda
> > > >> > > support
> > > >> > > > > as a separate artifact (so we can use all java8
types), and
> > > instead
> > > >> > of
> > > >> > > > the
> > > >> > > > > API being directly on the PSomething interfaces,
we just
> have
> > > >> > > convenient
> > > >> > > > > ways to wrap up lambdas into DoFns or MapFns via
> > > >> statically-imported
> > > >> > > > > methods.
> > > >> > > > >
> > > >> > > > > The usage then becomes
> > > >> > > > > import static org.apache.crunch.Lambda.*;
> > > >> > > > > ...
> > > >> > > > > someCollection.parallelDo(flatMap(d -> someFnOf(d)),
pt)
> > > >> > > > > ...
> > > >> > > > > otherGroupedTable.mapValue(reduce(seq -> seq.mapToInt(i
->
> > > >> i).sum()),
> > > >> > > > > ints())
> > > >> > > > >
> > > >> > > > > Where flatMap and reduce are static methods on
Lambda, and
> > > Lambda
> > > >> > goes
> > > >> > > in
> > > >> > > > > it's own artifact (to preserve compatibility with
6 and 7
> for
> > > the
> > > >> > rest
> > > >> > > of
> > > >> > > > > Crunch).
> > > >> > > > > I've attached a basic proof-of-concept implementation
which
> > I've
> > > >> > > tested a
> > > >> > > > > few things with, and I'm very happy to sketch
out a more
> > > >> substantial
> > > >> > > > > implementation if people here think it's a good
idea in
> > general.
> > > >> > > > >
> > > >> > > > > Thoughts? Ideas? Suggestions? Please tell me if
this is
> crazy.
> > > >> > > > >
> > > >> > > > >
> > > >> > > >
> > > >> > >
> > > >> >
> > > >>
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message