Return-Path: X-Original-To: apmail-crunch-dev-archive@www.apache.org Delivered-To: apmail-crunch-dev-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 88671183EA for ; Tue, 15 Dec 2015 09:34:23 +0000 (UTC) Received: (qmail 9805 invoked by uid 500); 15 Dec 2015 09:34:23 -0000 Delivered-To: apmail-crunch-dev-archive@crunch.apache.org Received: (qmail 9763 invoked by uid 500); 15 Dec 2015 09:34:23 -0000 Mailing-List: contact dev-help@crunch.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@crunch.apache.org Delivered-To: mailing list dev@crunch.apache.org Received: (qmail 9750 invoked by uid 99); 15 Dec 2015 09:34:23 -0000 Received: from Unknown (HELO spamd3-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 15 Dec 2015 09:34:23 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd3-us-west.apache.org (ASF Mail Server at spamd3-us-west.apache.org) with ESMTP id B1AAB18028D for ; Tue, 15 Dec 2015 09:34:22 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd3-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: -0.12 X-Spam-Level: X-Spam-Status: No, score=-0.12 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, RCVD_IN_MSPIKE_H3=-0.01, RCVD_IN_MSPIKE_WL=-0.01, SPF_PASS=-0.001, URIBL_BLOCKED=0.001] autolearn=disabled Authentication-Results: spamd3-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com Received: from mx1-us-east.apache.org ([10.40.0.8]) by localhost (spamd3-us-west.apache.org [10.40.0.10]) (amavisd-new, port 10024) with ESMTP id KnqagvT3SfXb for ; Tue, 15 Dec 2015 09:34:18 +0000 (UTC) Received: from mail-wm0-f52.google.com (mail-wm0-f52.google.com [74.125.82.52]) by mx1-us-east.apache.org (ASF Mail Server at mx1-us-east.apache.org) with ESMTPS id 2D667428DB for ; Tue, 15 Dec 2015 09:34:18 +0000 (UTC) Received: by mail-wm0-f52.google.com with SMTP id p66so81767344wmp.0 for ; Tue, 15 Dec 2015 01:34:18 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; bh=octoDo+WcAlRvbxIRJRMg1pIYnmXPzUBbXJQPW5M1NE=; b=A3ogOBCeRbHGYW2mZgHJDaCzLN0/0np7njjT2K8aJZdGl7o5B8cqAI4IgvSj7oiqeA cQ/uRx3p0RUu6PGP1TnCsUalhmZSpIimrRez8cnYpQdNMV8nHNJ4gjfTDzlySXM7+gZF O/atwSQjShITe/B3vLPtV7MblKDsDS6lAJA2NvHVj+b/sNAbZsWK8uWZCUcpFecTYDla CoL9JYIjq9mD44w19SQRPHoXrOERi3XKTuNoDtXUX8oo4+A5DJD5p81vIO97CkvpuL0C GChF6ELWSFm2MmnaaHXyLpGX+Y22mF3yAXmXQiiCROGNXqNFQz2Y6eR8JfSGz4E5S0T+ atlw== MIME-Version: 1.0 X-Received: by 10.28.12.69 with SMTP id 66mr3487315wmm.73.1450172051413; Tue, 15 Dec 2015 01:34:11 -0800 (PST) Received: by 10.28.3.132 with HTTP; Tue, 15 Dec 2015 01:34:11 -0800 (PST) In-Reply-To: References: Date: Tue, 15 Dec 2015 10:34:11 +0100 Message-ID: Subject: Re: Alternative strategy for incorporating Java 8 lambdas into Crunch From: Gabriel Reid To: dev Content-Type: text/plain; charset=UTF-8 Yeah, looking at the two next to each other I'm going for the collections approach as well. +1 On Tue, Dec 15, 2015 at 2:04 AM, Josh Wills wrote: > On Mon, Dec 14, 2015 at 4:15 PM, David Whiting > wrote: > >> 1) Not at all, just some leftover working names for stuff. >> >> 2) Not for a totally minimal implementation, but some of the features I >> would like to include would rely on Java 8 things, for example adapting the >> GroupedTable stuff to use Streams rather than Iterables because of a) the >> extra expressivity and b) the implied once-only traversal. We could have a >> filterMap which applies a Function> (my most common use case >> for a DoFn instead of a MapFn at the moment). We can also potentially >> utilise Collectors for collapsing values in reduce-side stuff and finally, >> it'll make the implementation of it a fair bit easier. The maven overhead >> is pretty low, so I guess it's just the existence of an extra artifact to >> consider. The way I see it is that it's a push to make the API feel more >> like Java streams and be more immediately usable by someone who knows Java >> streams but not necessarily big data, so the more we can replicate that >> feel by integrating with other familiar Java 8 features, the better. >> > > Makes sense to me. +1 for a new crunch-lambda module. > > >> >> On 15 December 2015 at 00:51, Josh Wills wrote: >> >> > I think I lean towards the collections approach, but that's probably >> > because of my Scrunch experience. Two questions: >> > >> > 1) Is mapToTable necessary? I would think map(SFunction, PTableType) >> would >> > be distinguishable from map(SFunction, PType) by the compiler in the same >> > way it is for parallelDo. >> > 2) Does the collections approach need a separate maven target at all, or >> > could it just be part of crunch-core as a replacement for the IFn stuff? >> Or >> > is there Java 8-only stuff we'll want to add in to its API? >> > >> > On Mon, Dec 14, 2015 at 3:13 PM, David Whiting wrote: >> > >> > > Ok, so I've implemented a few iterations of this. I went forward with >> the >> > > "wrap the functions" method, which seemed to work alright, but finding >> > good >> > > names for functions which essentially just wrap functions but which >> > aren't >> > > ambiguous in erasure and read nicely was a real challenge. I showed >> some >> > > sample code to some of my fellow data engineers and the consensus >> seemed >> > to >> > > be that it was definitely better than anonymous inner classes, but it >> > still >> > > felt kind of awkward and strange to use. >> > > >> > > So here's a 3rd option: wrap the collection types rather than the >> > function >> > > types, and present an API which feels truly Java 8 native whilst still >> > > being able to dig back to the underlying PCollections (doing pretty >> much >> > > what Scrunch does, but with less implicit Scala magic). >> > > >> > > Here's a super-minimal proof-of-concept for that: >> > > https://gist.github.com/DavW/7efe484ea0c00cf6e66b >> > > >> > > and a comparison of the two approaches in usage: >> > > https://gist.github.com/DavW/997a92b31d55c5317fb7 >> > > >> > > >> > > On 13 December 2015 at 16:14, Gabriel Reid >> > wrote: >> > > >> > > > This looks very cool. As long as we can keep things compatible with >> > > > Java 7 using whatever kind of maven voodoo that's necessary, I'm all >> > > > for it. >> > > > >> > > > I'd say no real reason to keep the IFn stuff if this goes in. >> > > > >> > > > - Gabriel >> > > > >> > > > On Fri, Dec 11, 2015 at 11:18 PM, Josh Wills >> > > wrote: >> > > > > It seems like a net positive over the IFn stuff, so I could make an >> > > > > argument for replacing it, but if there's anyone out there in love >> > > > w/IFns, >> > > > > they should speak up now. :) >> > > > > >> > > > > J >> > > > > >> > > > > On Fri, Dec 11, 2015 at 2:17 PM, David Whiting >> > > wrote: >> > > > > >> > > > >> I *think* you can set language level and target jdk on a >> per-module >> > > > basis, >> > > > >> so it should be relatively easy. I'll experiment at some point >> over >> > > the >> > > > >> weekend. Would this complement or replace the I*Fn stuff do you >> > think? >> > > > 14.0 >> > > > >> is not yet released, so I guess it's not too late to change if we >> > want >> > > > to. >> > > > >> >> > > > >> On 11 December 2015 at 22:57, Josh Wills >> > > wrote: >> > > > >> >> > > > >> > That's the sexiest thing I've seen in some time. +1 for a lambda >> > > > module, >> > > > >> > but how does that work in Maven-fu? Is it like a conditional >> > compile >> > > > or >> > > > >> > something? >> > > > >> > >> > > > >> > On Fri, Dec 11, 2015 at 1:20 PM, David Whiting > > >> > > > wrote: >> > > > >> > >> > > > >> > > Oops, my bad. Here's a Gist: >> > > > >> > > https://gist.github.com/DavW/e2588e42c45ad8c06038 >> > > > >> > > >> > > > >> > > On 11 December 2015 at 18:43, Josh Wills < >> josh.wills@gmail.com> >> > > > wrote: >> > > > >> > > >> > > > >> > > > I think it's kind of awesome, but the attachment didn't go >> > > > through- >> > > > >> PR >> > > > >> > or >> > > > >> > > > gist? >> > > > >> > > > On Fri, Dec 11, 2015 at 7:42 AM David Whiting < >> > davw@apache.org> >> > > > >> wrote: >> > > > >> > > > >> > > > >> > > > > While fixing the bug where the IFn version of mapValues on >> > > > >> > > PGroupedTable >> > > > >> > > > > was missing, I got thinking that this is quite an >> > inefficient >> > > > way >> > > > >> of >> > > > >> > > > > including support for lambdas and method references, and >> it >> > > > still >> > > > >> > > didn't >> > > > >> > > > > actually support quite a few of the features that would >> make >> > > it >> > > > >> easy >> > > > >> > to >> > > > >> > > > > code against. >> > > > >> > > > > >> > > > >> > > > > Negative parts of existing lambda implementation: >> > > > >> > > > > 1) Explosion of already-crowded PCollection, PTable and >> > > > >> PGroupedTable >> > > > >> > > > > interfaces, and having to implement those methods in all >> > > > >> > > implementations. >> > > > >> > > > > 2) Not supporting flatMap to Optional or Stream types. >> > > > >> > > > > 3) Not exposing convenient types for reduce-type >> operations >> > > > (Stream >> > > > >> > > > > instead of Iterable, for example). >> > > > >> > > > > >> > > > >> > > > > Something that would solve all three of these is to build >> > > lambda >> > > > >> > > support >> > > > >> > > > > as a separate artifact (so we can use all java8 types), >> and >> > > > instead >> > > > >> > of >> > > > >> > > > the >> > > > >> > > > > API being directly on the PSomething interfaces, we just >> > have >> > > > >> > > convenient >> > > > >> > > > > ways to wrap up lambdas into DoFns or MapFns via >> > > > >> statically-imported >> > > > >> > > > > methods. >> > > > >> > > > > >> > > > >> > > > > The usage then becomes >> > > > >> > > > > import static org.apache.crunch.Lambda.*; >> > > > >> > > > > ... >> > > > >> > > > > someCollection.parallelDo(flatMap(d -> someFnOf(d)), pt) >> > > > >> > > > > ... >> > > > >> > > > > otherGroupedTable.mapValue(reduce(seq -> seq.mapToInt(i -> >> > > > >> i).sum()), >> > > > >> > > > > ints()) >> > > > >> > > > > >> > > > >> > > > > Where flatMap and reduce are static methods on Lambda, and >> > > > Lambda >> > > > >> > goes >> > > > >> > > in >> > > > >> > > > > it's own artifact (to preserve compatibility with 6 and 7 >> > for >> > > > the >> > > > >> > rest >> > > > >> > > of >> > > > >> > > > > Crunch). >> > > > >> > > > > I've attached a basic proof-of-concept implementation >> which >> > > I've >> > > > >> > > tested a >> > > > >> > > > > few things with, and I'm very happy to sketch out a more >> > > > >> substantial >> > > > >> > > > > implementation if people here think it's a good idea in >> > > general. >> > > > >> > > > > >> > > > >> > > > > Thoughts? Ideas? Suggestions? Please tell me if this is >> > crazy. >> > > > >> > > > > >> > > > >> > > > > >> > > > >> > > > >> > > > >> > > >> > > > >> > >> > > > >> >> > > > >> > > >> > >>