Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id 19F28200D20 for ; Tue, 17 Oct 2017 10:55:44 +0200 (CEST) Received: by cust-asf.ponee.io (Postfix) id 1883B1609EB; Tue, 17 Oct 2017 08:55:44 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id 3413C1609DE for ; Tue, 17 Oct 2017 10:55:43 +0200 (CEST) Received: (qmail 91345 invoked by uid 500); 17 Oct 2017 08:55:42 -0000 Mailing-List: contact user-help@flink.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Delivered-To: mailing list user@flink.apache.org Received: (qmail 91335 invoked by uid 99); 17 Oct 2017 08:55:41 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd2-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 17 Oct 2017 08:55:41 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd2-us-west.apache.org (ASF Mail Server at spamd2-us-west.apache.org) with ESMTP id 142781A0407 for ; Tue, 17 Oct 2017 08:55:41 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd2-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 1.879 X-Spam-Level: * X-Spam-Status: No, score=1.879 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, HTML_MESSAGE=2, RCVD_IN_DNSWL_NONE=-0.0001, RCVD_IN_MSPIKE_H3=-0.01, RCVD_IN_MSPIKE_WL=-0.01, SPF_PASS=-0.001] autolearn=disabled Authentication-Results: spamd2-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com Received: from mx1-lw-eu.apache.org ([10.40.0.8]) by localhost (spamd2-us-west.apache.org [10.40.0.9]) (amavisd-new, port 10024) with ESMTP id H1qE7h4M4Cjv for ; Tue, 17 Oct 2017 08:55:38 +0000 (UTC) Received: from mail-ua0-f174.google.com (mail-ua0-f174.google.com [209.85.217.174]) by mx1-lw-eu.apache.org (ASF Mail Server at mx1-lw-eu.apache.org) with ESMTPS id 2FEBC5FAC9 for ; Tue, 17 Oct 2017 08:55:38 +0000 (UTC) Received: by mail-ua0-f174.google.com with SMTP id u32so650928uau.0 for ; Tue, 17 Oct 2017 01:55:38 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :cc; bh=rJyd3AItcjmStuMTZYGdmv5HAm3S8z6xaJ5tdUpkU/w=; b=BpNBSSkAy+t0t1gVUDl74UO1vMNTYcBJ6LJoxeHBO2jw244Fc9H8hRKoWDtTIYx3qz 6BkmoJ2cmtG6o1rH0RiDv+0sikCF2/156MlJBPWrjA0b3/uiqvTtvUKqVzczXUty2jK8 GT71Iy0G2UY8ZSGNtu/Nlvy+fH5h2luXLy93Zozy9yYT4iH/V5pCtQXtvDfrolKQlX/L gfSxA75Ez5dCTOSZ+vml8YY2MWAevBh0vMZ/XFOcsz7XilDLyxalNHZpc6u0flzAc6qN fS46s/3grXkRGwXXAuVQEPH1Dk44WneV4isYNNKsw72HoH02p999VyPwPqjdhiz72Fwy 1TeA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:in-reply-to:references:from:date :message-id:subject:to:cc; bh=rJyd3AItcjmStuMTZYGdmv5HAm3S8z6xaJ5tdUpkU/w=; b=UuTM6abmM0ZPPzOCEzYYcJ65KprSynotJcOPHjjXdQcAyzUK3CS48dZjmhPcxyAdMT byhjtIZqTIUrkIYFxIkaptxV9U17jd997nYVVUacKMxu6sW6f9MYZsbkBRNx0NWqOt8/ uqtFNREWNRwmHVSIaJj254Br6A1HoDQ85CLBpE5eY89ijjqtVQqRBXauBgM1YcKzbAkK T22+muKN5Zrhbn0cL7co5mYr1Mh6BsDxfMayySiqh7SrzKEwLgcoYXQA2W6twt8M0lK0 CVO2w4tenhfs7ZTOfagLIEUkMbg17BfuPd3ZEbP+7VhrA4LAt4I1g0FjrRr7it+kxFal W6gA== X-Gm-Message-State: AMCzsaVNhFJruNjUKvh6RQZURNzzloyiaDtZDR0yp4pm5r5xrPan8Y5R +sCEUNZ5MrXofcRh6tqTLt55QEm4kuRHBS23KeM= X-Google-Smtp-Source: ABhQp+R0GtsxHn6HWSI4NwwLPwBCd0Z8DVAj6FX/yBxILulBU4goyZUwgXeSi4GLdU5RMzn+cGg70lBnUiMpdqH5N9g= X-Received: by 10.176.8.201 with SMTP id o9mr7509722uaf.155.1508230530757; Tue, 17 Oct 2017 01:55:30 -0700 (PDT) MIME-Version: 1.0 Received: by 10.103.12.69 with HTTP; Tue, 17 Oct 2017 01:54:49 -0700 (PDT) In-Reply-To: <0E0AC9FC-9100-45CD-BC14-498E4AE4C1EB@gmail.com> References: <0E0AC9FC-9100-45CD-BC14-498E4AE4C1EB@gmail.com> From: Fabian Hueske Date: Tue, 17 Oct 2017 10:54:49 +0200 Message-ID: Subject: Re: Split a dataset To: Magnus Vojbacke Cc: user Content-Type: multipart/alternative; boundary="f403045e630c0677fe055bba4935" archived-at: Tue, 17 Oct 2017 08:55:44 -0000 --f403045e630c0677fe055bba4935 Content-Type: text/plain; charset="UTF-8" Hi Magnus, there is no Split operator on the DataSet API. As you said, this can be done using a FilterFunction. This also allows for non-binary splits: DataSet setToSplit = ... DataSet firstSplit = setToSplit.filter(new SplitCondition1()); DataSet secondSplit = setToSplit.filter(new SplitCondition2()); DataSet thirdSplit = setToSplit.filter(new SplitCondition3()); where SplitCondition1, SplitCondition2, and SplitCondition3 are FilterFunction that filter out all records that don't belong to the split. Best, Fabian 2017-10-17 10:42 GMT+02:00 Magnus Vojbacke : > I'm looking for something like DataStream.split(), but for DataSets. I'd > like to split my streaming data so messages go to different parts of an > execution graph, based on arbitrary logic. > > DataStream.split() seems to be perfect, except that my source is a CSV > file, and I have only found built in functions for reading CSV files into a > DataSet. > > I've evaluated using DataSet.filter(), but as far as I can tell, that only > allows me to emulate a yes/no split. This is not ideal because it's too > coarse, and I would prefer a more fine grained split than that. > > > Do you have any suggestions on how I can achieve my arbitrary splitting > logic for a) DataSets in general, or b) CSV files? > > --f403045e630c0677fe055bba4935 Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable
Hi Magnus,

ther= e is no Split operator on the DataSet API.

As you said, this c= an be done using a FilterFunction. This also allows for non-binary splits:<= br>
DataSet<X> setToSplit =3D ...
DataSet<X> = firstSplit =3D setToSplit.filter(new SplitCondition1());
DataSet<X>= ; secondSplit =3D setToSplit.filter(new SplitCondition2());
DataSet<X= > thirdSplit =3D setToSplit.filter(new SplitCondition3());

= where SplitCondition1, SplitCondition2, and SplitCondition3 are FilterFunct= ion that filter out all records that don't belong to the split.

=
Best, Fabian

2017-10-17 10:42 GMT+02:00 Magnus Vojbacke <= ;magnus.vojb= acke@gmail.com>:
I'm = looking for something like DataStream.split(), but for DataSets. I'd li= ke to split my streaming data so messages go to different parts of an execu= tion graph, based on arbitrary logic.

DataStream.split() seems to be perfect, except that my source is a CSV file= , and I have only found built in functions for reading CSV files into a Dat= aSet.

I've evaluated using DataSet.filter(), but as far as I can tell, that o= nly allows me to emulate a yes/no split. This is not ideal because it's= too coarse, and I would prefer a more fine grained split than that.


Do you have any suggestions on how I can achieve my arbitrary splitting log= ic for a) DataSets in general, or b) CSV files?


--f403045e630c0677fe055bba4935--