Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id 8C2F6200BD7 for ; Sun, 11 Dec 2016 08:52:49 +0100 (CET) Received: by cust-asf.ponee.io (Postfix) id 8ABDD160B20; Sun, 11 Dec 2016 07:52:49 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id AE45B160B07 for ; Sun, 11 Dec 2016 08:52:48 +0100 (CET) Received: (qmail 89198 invoked by uid 500); 11 Dec 2016 07:52:42 -0000 Mailing-List: contact dev-help@kafka.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@kafka.apache.org Delivered-To: mailing list dev@kafka.apache.org Received: (qmail 89182 invoked by uid 99); 11 Dec 2016 07:52:42 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd1-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 11 Dec 2016 07:52:42 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd1-us-west.apache.org (ASF Mail Server at spamd1-us-west.apache.org) with ESMTP id 01F87C1E71 for ; Sun, 11 Dec 2016 07:52:42 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd1-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 1.279 X-Spam-Level: * X-Spam-Status: No, score=1.279 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, HTML_MESSAGE=2, RCVD_IN_DNSWL_LOW=-0.7, RCVD_IN_MSPIKE_H3=-0.01, RCVD_IN_MSPIKE_WL=-0.01, SPF_PASS=-0.001] autolearn=disabled Authentication-Results: spamd1-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=confluent-io.20150623.gappssmtp.com Received: from mx1-lw-us.apache.org ([10.40.0.8]) by localhost (spamd1-us-west.apache.org [10.40.0.7]) (amavisd-new, port 10024) with ESMTP id dEh65L4Htavu for ; Sun, 11 Dec 2016 07:52:40 +0000 (UTC) Received: from mail-qk0-f180.google.com (mail-qk0-f180.google.com [209.85.220.180]) by mx1-lw-us.apache.org (ASF Mail Server at mx1-lw-us.apache.org) with ESMTPS id E14D85F254 for ; Sun, 11 Dec 2016 07:52:39 +0000 (UTC) Received: by mail-qk0-f180.google.com with SMTP id x190so56996095qkb.0 for ; Sat, 10 Dec 2016 23:52:39 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=confluent-io.20150623.gappssmtp.com; s=20150623; h=mime-version:in-reply-to:references:from:date:message-id:subject:to; bh=g+CkKQ/RpYv6Op7BE+yXHVJqnko194/dmBGv2ZtPtZk=; b=o6KD6wvawVuiQ0SvDpmA0408X+ypfDrc1yXn6vvdvwQVHDYKPvdGl7dvlL5YvDEGZj EGocGzrv0UdBEM6AYeVvdALnTkRYWkZDmKt8cuzplJs8lAL25vXLoeTH6kU6vhuJPTqs CAQ+77wJibi6mTCpls/273Osbdvt5Rjmm2COLQZjXIU1xj4WeqTNYaup+RmZ6GDq5mC+ xNTWvJz4pXA+gx8S39x8vN14sZVFswRJLvFhI4k9kN+V2wWPgb5W1bnHCpFxMwI/d6JE iCGCteBQhkK+CWjDZZDU+AKHlbjKtzIA8VNa5DF1RztQ/xmKOQCH4PByiUAhcHNS23Kk kJcA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:in-reply-to:references:from:date :message-id:subject:to; bh=g+CkKQ/RpYv6Op7BE+yXHVJqnko194/dmBGv2ZtPtZk=; b=gEQ3jAGh5lO4c2UssdtOVjRjhkq2L+iG4TH3VO6++dlKE31b0Cm+bXJRZKDTKbim35 Bz7Rc19Y38CQ0AxAJc6GH/75Sk+GWG1Jup7b8MKy/+HNFBRpj+vOKarMuHRJMr6CVD82 PDQYqdBThkt7oG/9VAsO2Pngco8ss5E7QFoemxA90+ntco0vNgNoB+AeLOhUcdFKEdZK pweNOUv2OGMvHi7gc+N0fX8Ky2CzjTz99wbuawqxs0JW+n9mzmiVYLbUyHuBm3f7HGTE VlZThxJs8BqnqN+ezI3SYZ9x3dIuT036+Dc6LYTNFTpW8VLPB1G4utb3yQmWIzUFK1uz Dryw== X-Gm-Message-State: AKaTC00itipla7kWPi3KoQe469luNYUn/Uoapsf6CrbWhehtaO3CyE5DJ0q21iDtXahbpazZ0DaQ4o8Vl+zA6Ky9 X-Received: by 10.55.127.129 with SMTP id a123mr74713459qkd.129.1481442759409; Sat, 10 Dec 2016 23:52:39 -0800 (PST) MIME-Version: 1.0 Received: by 10.12.135.181 with HTTP; Sat, 10 Dec 2016 23:52:38 -0800 (PST) In-Reply-To: References: <704B7595-7DFD-4200-B0D3-470450F5746A@gmail.com> <1FC3387D-43AB-4328-9358-803BA3399430@gmail.com> <7B72AC18-80E8-4B33-BE01-650D7FF227AC@gmail.com> <9255BEC7-3440-404B-A5D5-1829B981B8CE@gmail.com> From: Ewen Cheslack-Postava Date: Sat, 10 Dec 2016 23:52:38 -0800 Message-ID: Subject: Re: [DISCUSS] KIP-66 Kafka Connect Transformers for messages To: dev@kafka.apache.org Content-Type: multipart/alternative; boundary=94eb2c05abfa6e1a7305435d451a archived-at: Sun, 11 Dec 2016 07:52:49 -0000 --94eb2c05abfa6e1a7305435d451a Content-Type: text/plain; charset=UTF-8 If anyone has time to review here, it'd be great to get feedback. I'd imagine that the proposal itself won't be too controversial -- keeps transformations simple (by only allowing map/filter), doesn't affect the rest of the framework much, and fits in with general config structure we've used elsewhere (although ConfigDef could use some updates to make this easier...). I think the main open questions for me are: a) Is TransformableRecord worth it to avoid reimplementing small bits of code (it allows for a single implementation of the interface to trivially apply to both Source and SinkRecords). I think I prefer this, but it does come with some commitment to another interface on top of ConnectRecord. We could alternatively modify ConnectRecord which would require fewer changes. b) How do folks feel about built-in transformations and the set that are mentioned here? This brings us way back to the discussion of built-in connectors. Transformations, especially when intended to be lightweight and touch nothing besides the data already in the record, seem different from connectors -- there might be quite a few, but hopefully limited. Since we (hopefully) already factor out most serialization-specific stuff via Converters, I think we can keep this pretty limited. That said, I have no doubt some folks will (in my opinion) abuse this feature to do data enrichment by querying external systems, so building a bunch of transformations in could potentially open the floodgates, or at least make decisions about what is included vs what should be 3rd party muddy. -Ewen On Wed, Dec 7, 2016 at 11:46 AM, Shikhar Bhushan wrote: > Hi all, > > I have another iteration at a proposal for this feature here: > https://cwiki.apache.org/confluence/display/KAFKA/ > Connect+Transforms+-+Proposed+Design > > I'd welcome your feedback and comments. > > Thanks, > > Shikhar > > On Tue, Aug 2, 2016 at 7:21 PM Ewen Cheslack-Postava > wrote: > > On Thu, Jul 28, 2016 at 11:58 PM, Shikhar Bhushan > wrote: > > > > > > > > > > Hmm, operating on ConnectRecords probably doesn't work since you need > to > > > emit the right type of record, which might mean instantiating a new > one. > > I > > > think that means we either need 2 methods, one for SourceRecord, one > for > > > SinkRecord, or we'd need to limit what parts of the message you can > > modify > > > (e.g. you can change the key/value via something like > > > transformKey(ConnectRecord) and transformValue(ConnectRecord), but > other > > > fields would remain the same and the fmwk would handle allocating new > > > Source/SinkRecords if needed) > > > > > > > Good point, perhaps we could add an abstract method on ConnectRecord that > > takes all the shared fields as parameters and the implementations return > a > > copy of the narrower SourceRecord/SinkRecord type as appropriate. > > Transformers would only operate on ConnectRecord rather than caring about > > SourceRecord or SinkRecord (in theory they could instanceof/cast, but the > > API should discourage it) > > > > > > > Is there a use case for hanging on to the original? I can't think of a > > > transformation where you'd need to do that (or couldn't just order > things > > > differently so it isn't a problem). > > > > > > Yeah maybe this isn't really necessary. No strong preference here. > > > > That said, I do worry a bit that farming too much stuff out to > transformers > > > can result in "programming via config", i.e. a lot of the simplicity > you > > > get from Connect disappears in long config files. Standardization would > > be > > > nice and might just avoid this (and doesn't cost that much implementing > > it > > > in each connector), and I'd personally prefer something a bit less > > flexible > > > but consistent and easy to configure. > > > > > > Not sure what the you're suggesting :-) Standardized config properties > for > > a small set of transformations, leaving it upto connectors to integrate? > > > > I just mean that you get to the point where you're practically writing a > Kafka Streams application, you're just doing it through either an > incredibly convoluted set of transformers and configs, or a single > transformer with incredibly convoluted set of configs. You basically get to > the point where you're config is a mini DSL and you're not really saving > that much. > > The real question is how much we want to venture into the "T" part of ETL. > I tend to favor minimizing how much we take on since the rest of Connect > isn't designed for it, it's designed around the E & L parts. > > -Ewen > > > > Personally I'm skeptical of that level of flexibility in transformers -- > > > its getting awfully complex and certainly takes us pretty far from > > "config > > > only" realtime data integration. It's not clear to me what the use > cases > > > are that aren't covered by a small set of common transformations that > can > > > be chained together (e.g. rename/remove fields, mask values, and maybe > a > > > couple more). > > > > > > > I agree that we should have some standard transformations that we ship > with > > connect that users would ideally lean towards for routine tasks. The ones > > you mention are some good candidates where I'd imagine can expose simple > > config, e.g. > > transform.filter.whitelist=x,y,z # filter to a whitelist of fields > > transfom.rename.spec=oldName1=>newName1, oldName2=>newName2 > > topic.rename.replace=-/_ > > topic.rename.prefix=kafka_ > > etc.. > > > > However the ecosystem will invariably have more complex transformers if > we > > make this pluggable. And because ETL is messy, that's probably a good > thing > > if folks are able to do their data munging orthogonally to connectors, so > > that connectors can focus on the logic of how data should be copied > from/to > > datastores and Kafka. > > > > > > > In any case, we'd probably also have to change configs of connectors if > > we > > > allowed configs like that since presumably transformer configs will > just > > be > > > part of the connector config. > > > > > > > Yeah, haven't thought much about how all the configuration would tie > > together... > > > > I think we'd need the ability to: > > - spec transformer chain (fully-qualified class names? perhaps special > > aliases for built-in ones? perhaps third-party fqcns can be assigned > > aliases by users in the chain spec, for easier configuration and to > > uniquely identify a transformation when it occurs more than one time in a > > chain?) > > - configure each transformer -- all properties prefixed with that > > transformer's ID (fqcn / alias) get destined to it > > > > Additionally, I think we would probably want to allow for topic-specific > > overrides (e.g. you > > want > > certain transformations for one topic, but different ones for another...) > > > > > > -- > Thanks, > Ewen > --94eb2c05abfa6e1a7305435d451a--