From dev-return-9532-archive-asf-public=cust-asf.ponee.io@beam.apache.org  Fri May  4 18:49:05 2018
Return-Path: <dev-return-9532-archive-asf-public=cust-asf.ponee.io@beam.apache.org>
X-Original-To: archive-asf-public@cust-asf.ponee.io
Delivered-To: archive-asf-public@cust-asf.ponee.io
Received: from mail.apache.org (hermes.apache.org [140.211.11.3])
	by mx-eu-01.ponee.io (Postfix) with SMTP id D3DE5180634
	for <archive-asf-public@cust-asf.ponee.io>; Fri,  4 May 2018 18:49:03 +0200 (CEST)
Received: (qmail 25438 invoked by uid 500); 4 May 2018 16:49:02 -0000
Mailing-List: contact dev-help@beam.apache.org; run by ezmlm
Precedence: bulk
List-Help: <mailto:dev-help@beam.apache.org>
List-Unsubscribe: <mailto:dev-unsubscribe@beam.apache.org>
List-Post: <mailto:dev@beam.apache.org>
List-Id: <dev.beam.apache.org>
Reply-To: dev@beam.apache.org
Delivered-To: mailing list dev@beam.apache.org
Received: (qmail 25420 invoked by uid 99); 4 May 2018 16:49:02 -0000
Received: from pnap-us-west-generic-nat.apache.org (HELO spamd4-us-west.apache.org) (209.188.14.142)
    by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 04 May 2018 16:49:02 +0000
Received: from localhost (localhost [127.0.0.1])
	by spamd4-us-west.apache.org (ASF Mail Server at spamd4-us-west.apache.org) with ESMTP id 91DC8C05EC
	for <dev@beam.apache.org>; Fri,  4 May 2018 16:49:01 +0000 (UTC)
X-Virus-Scanned: Debian amavisd-new at spamd4-us-west.apache.org
X-Spam-Flag: NO
X-Spam-Score: 1.189
X-Spam-Level: *
X-Spam-Status: No, score=1.189 tagged_above=-999 required=6.31
	tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1,
	HTML_MESSAGE=2, RCVD_IN_DNSWL_LOW=-0.7, SPF_PASS=-0.001,
	T_RP_MATCHES_RCVD=-0.01] autolearn=disabled
Authentication-Results: spamd4-us-west.apache.org (amavisd-new);
	dkim=pass (2048-bit key) header.d=google.com
Received: from mx1-lw-us.apache.org ([10.40.0.8])
	by localhost (spamd4-us-west.apache.org [10.40.0.11]) (amavisd-new, port 10024)
	with ESMTP id ogv1ksPhTZw7 for <dev@beam.apache.org>;
	Fri,  4 May 2018 16:48:58 +0000 (UTC)
Received: from mail-it0-f50.google.com (mail-it0-f50.google.com [209.85.214.50])
	by mx1-lw-us.apache.org (ASF Mail Server at mx1-lw-us.apache.org) with ESMTPS id 033A85F1B4
	for <dev@beam.apache.org>; Fri,  4 May 2018 16:48:58 +0000 (UTC)
Received: by mail-it0-f50.google.com with SMTP id j186-v6so4152118ita.5
        for <dev@beam.apache.org>; Fri, 04 May 2018 09:48:57 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=google.com; s=20161025;
        h=mime-version:references:in-reply-to:from:date:message-id:subject:to;
        bh=wkTQBqEABHdfdvBLXaTbTZ4LnSlUzNOB5+7t5dgcs7c=;
        b=ONWtGrKDpq+JQg2u21SvVT67l5VhCh68VCJ3Z/ZjYp49xPmsHQOg3QJLURhqw1i8oq
         sXQDY+WnUm9OqPpiFs8fIn7NYeCIfSR3X5SRiZE9zJy+DY61TEBSXaa/vjDal1FEITH2
         9+F2mCIVbmHrPAYZZc2ets9U5hDKhXInifOv/umkF7NwhIqsY6R39wImvG3p9bIgC31x
         dpBBSqqq9b7ok3jmuHa4sr9y6GDeE2GkES4MgvmDZ2UovTPME29mu76Ri1K3EFyhX3Vl
         HWTNZCDb3xgfZux1PKncI+SkFTvs3hj/5h1fWLE3ZkxdCfPiVO2xXfAMTGVtmtqVBOwr
         251g==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20161025;
        h=x-gm-message-state:mime-version:references:in-reply-to:from:date
         :message-id:subject:to;
        bh=wkTQBqEABHdfdvBLXaTbTZ4LnSlUzNOB5+7t5dgcs7c=;
        b=fSMnR0TmWavis/9Tr/mQMzG1VITLlJZ15zFlGVNssW/ogHa2ZYcuEzsJDe/8ec2c2g
         RvdSNeYU58kCoAcn7nLLunWXrM3Z/IQxXekL+tEAttbVCIMgIcYNmyL4QVu5oJFbqWAo
         h9F4dE3BF8AzBlmTMZsLnmrlyy8QToGoykgSCCQdsJnJuhF85aSepBuP1jlrRHZrIBRM
         Qeki02tiZozj2JfULIiHixXoLJvKhrXKaSXIworYB+lIUdf1k/QpZ7y6QA7Rsk867zRA
         Ks90V+iBnRZYDd0/CGQUqSmOPF3pmuVOkqBesi9Ff8g1uI9B4QR1R3k+4nsGunhH2r13
         HQTg==
X-Gm-Message-State: ALQs6tB7QLmC2+iq1/YFXKt3QCcWkqTkRmvn5kbhkJTuSbeJblqg/xq7
	zGNSK9CP5/00r9yXOvj8Z3pjMHRYuG4a5sBjbgfSBqKBa+c=
X-Google-Smtp-Source: AB8JxZrm2aN3ArvqGedSm28Ck6+TPipKfguGW9jNRPTcc2gyaCg0Mmhvky8jH2p+s+gJlo7mI9NjNc1Xyp4hQAWjVsA=
X-Received: by 2002:a24:e9c1:: with SMTP id f184-v6mr29204410ith.79.1525452536915;
 Fri, 04 May 2018 09:48:56 -0700 (PDT)
MIME-Version: 1.0
References: <CAKYz-uiKZnN1G+JRdEmv8Q46JduDNE7nZXcR5zdYjEra0Sf+KA@mail.gmail.com>
 <CAGAbUe90dWnbsdHTWmP59EPdtSysS7Hg6GBb+b1eDMc53vq5sg@mail.gmail.com>
 <CAKYz-uj+uDTvzYDu4BLFsoyqBe3d9HX07bRowmo_1+v-5RZb+g@mail.gmail.com> <CAF9t7_60Yv42toki8ws4SfVu5RWbx6W+f6No5yjWXguuBumSew@mail.gmail.com>
In-Reply-To: <CAF9t7_60Yv42toki8ws4SfVu5RWbx6W+f6No5yjWXguuBumSew@mail.gmail.com>
From: Chamikara Jayalath <chamikara@google.com>
Date: Fri, 04 May 2018 16:48:45 +0000
Message-ID: <CAGAbUe9HykmJkTGm8MAW9o_GFhUEF6cP0fmQ9r5aTnHuMLMzgA@mail.gmail.com>
Subject: Re: I want to allow a user-specified QuerySplitter for DatastoreIO
To: dev@beam.apache.org
Content-Type: multipart/alternative; boundary="00000000000096386b056b64188b"

--00000000000096386b056b64188b
Content-Type: text/plain; charset="UTF-8"

Hi Frank,

On Thu, May 3, 2018 at 1:07 PM Lukasz Cwik <lcwik@google.com> wrote:

> I also like the idea of doing the splitting when the pipeline is running
> and not during pipeline construction. This works a lot better with things
> like templates.
>
> Do you know what Maven package contains com.google.rpc classes and what is
> the transitive dependency tree of the package?
>
> If those dependencies are already exposed (or not complex) then adding
> com.google.rpc to the API surface whitelist will be a non-issue.
>
> On Thu, May 3, 2018 at 8:28 AM Frank Yellin <fy@fyellin.com> wrote:
>
>> I actually tried (1), and ran precisely into the size limit that you
>> mentioned.  Because of the size of the database, I needed to split it into
>> a few hundred shards, and that was more than the request limit.
>>
>
Have you tried adding a Reshuffle transform after reading from Datastore ?
Even if you have fewer number of initial shards, reshuffle could
significantly help by allowing further parallelize the next steps.

https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/transforms/Reshuffle.java#L65


>
>> I was also considering a slightly different alternative to (2), such as
>> adding setQueries(), or setSplitterPTransform().  The semantics would be
>> identical to that of your ReadAll, but I'd be able to reuse more of the
>> code that is there.  This gave me interesting results, but it wasn't as
>> powerful as what I needed.  See (2) below.
>>
>>
Could you explain how these would be semantically equivalent to ReadAll ?
With the ReadAll transform the flow would be somthing like following.

pipeline.apply(ParDo(MyDoFnThatSplitsQueries())).apply(DatastoreIO.ReadAll()).

'MyDoFnThatSplitsQueries' would be your custom DoFn that performs splitting
(to as many splits as you want).


> The two specific use cases that were motivating me were that I needed to
>> write code that could
>>       (1) delete a property from all Entitys whose creationTime is
>> between one month and two months ago..
>>       (2) delete all Entitys whose creationTime is more than two years
>> ago.
>> I think these are common-enough operations.  For a very large database,
>> it would be nice to be able to open read the small piece of it that is
>> needed for your operation.
>>
>>
Have you considered adding a filter ParDo that follows the read ? I
understand that this would increase the amount of data that you read but I
still prefer not allowing users to customize splitting due to serious
issues I previously mentioned. Regarding deletion, I don't think source is
the right place for that. We provide a separate transform for deletion. Can
you try to use that ?

https://github.com/apache/beam/blob/master/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/datastore/DatastoreV1.java#L1009


> The first is easy to handle.  I know the start and end of creationTime,
>> and I can shard it myself.  The second requires me to consult the datastore
>> to find out what the smallest creationTime is in the datastore, and then
>> use it as a[n] (advisory  not hard,) lower limit; the query splitter should
>> work well whether the oldest records were four years old or barely more
>> than two years old.   For this to be possible, I need access to the
>> Datastore object, and this Datastore object needs to be passed as some sort
>> of user callback.  The QuerySplitter hook already existed and seemed to fit
>> my needs perfectly.
>>
>> Is there a better alternative that still gives me access to the Datastore?
>>
>>
>>
>>
>>
>>
>>
>> On Thu, May 3, 2018 at 2:52 AM, Chamikara Jayalath <chamikara@google.com>
>> wrote:
>>
>>> Thanks. IMHO it might be better to perform this splitting as a part of
>>> your pipeline instead of making source splitting customizable. The reason
>>> is, it's easy for users to shoot themselves on the foot if we allow
>>> specifying a custom splitter. A bug in a custom QuerySplitter can result in
>>> a hard to catch data loss or data duplication bug. So I'd rather not make
>>> it a part of the user API.
>>>
>>> I can think of two ways for performing this splitting as a part of your
>>> pipeline.
>>> (1) Split the query during job construction and create a source per
>>> query. This can be followed by a Flatten transform that creates a single
>>> PCollection. (Once caveat is, you might run into 10MB request size limit if
>>> you create two many splits here. So try reducing the number of splits if
>>> you ran into this).
>>> (2) Add a ReadAll transform to DatastoreIO. This will allow you to
>>> precede the step that performs reading by a ParDo step that splits your
>>> query and create a PCollection of queries. You should not run into size
>>> limits here since splitting happens in the data plane.
>>>
>>> Thanks,
>>> Cham
>>>
>>> On Wed, May 2, 2018 at 12:50 PM Frank Yellin <fy@fyellin.com> wrote:
>>>
>>>> TLDR:
>>>> Is it okay for me to expose Datastore in apache beam's DatastoreIO,
>>>> and thus indirectly expose com.google.rpc.Code?
>>>> Is there a better solution?
>>>>
>>>>
>>>> As I explain in Beam 4186
>>>> <https://issues.apache.org/jira/browse/BEAM-4186>, I would like to be
>>>> able to extend DatastoreV1.Read to have a
>>>>        withQuerySplitter(QuerrySplitter querySplitter)
>>>> method, which would use an alternative query splitter.   The standard
>>>> one shards by key and is very limited.
>>>>
>>>> I have already written such a query splitter.  In fact, the query
>>>> splitter I've written goes further than specified in the beam, and reads
>>>> the minimum or maximum value of the field from the datastore if no minimum
>>>> or maximum is specified in the query, and uses that value for the
>>>> sharding.   I can write:
>>>>        SELECT * FROM ledger where type = 'purchase'
>>>> and then ask it to shard on the eventTime, and it will shard nicely!
>>>> I am working with the Datastore folks to separately add my new query
>>>> splitter as an option in DatastoreHelper.
>>>>
>>>>
>>>> I have already written the code to add withQuerySplitter.
>>>>
>>>>        https://github.com/apache/beam/pull/5246
>>>>
>>>> However the problem is that I am increasing the "surface API" of
>>>> Dataflow.
>>>>        QuerySplitter exposes Datastore  exposes DatastoreException
>>>> exposes com.google.rpc.Code
>>>> and com.google.rpc.Code is not (yet) part of the API surface.
>>>>
>>>> As a solution, I've added package com.google.rpc to the list of classes
>>>> exposed.  This package contains protobuf enums.  Is this okay?  Is there a
>>>> better solution?
>>>>
>>>> Thanks.
>>>>
>>>>
>>

--00000000000096386b056b64188b
Content-Type: text/html; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr">Hi Frank,<br><br><div class=3D"gmail_quote"><div dir=3D"lt=
r">On Thu, May 3, 2018 at 1:07 PM Lukasz Cwik &lt;<a href=3D"mailto:lcwik@g=
oogle.com">lcwik@google.com</a>&gt; wrote:<br></div><blockquote class=3D"gm=
ail_quote" style=3D"margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,=
204,204);padding-left:1ex"><div dir=3D"ltr">I also like the idea of doing t=
he splitting when the pipeline is running and not during pipeline construct=
ion. This works a lot better with things like templates.<div><br></div><div=
>Do you know what Maven package contains com.google.rpc classes and what is=
 the transitive dependency tree of the package?</div><div><br></div><div>If=
 those dependencies are already exposed (or not complex) then adding com.go=
ogle.rpc to the API surface whitelist will be a non-issue.</div></div><br><=
div class=3D"gmail_quote"><div dir=3D"ltr">On Thu, May 3, 2018 at 8:28 AM F=
rank Yellin &lt;<a href=3D"mailto:fy@fyellin.com" target=3D"_blank">fy@fyel=
lin.com</a>&gt; wrote:<br></div><blockquote class=3D"gmail_quote" style=3D"=
margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-lef=
t:1ex"><div dir=3D"ltr"><span style=3D"color:rgb(34,34,34);font-family:aria=
l,sans-serif;font-size:small;font-style:normal;font-variant-ligatures:norma=
l;font-variant-caps:normal;font-weight:400;letter-spacing:normal;text-align=
:start;text-indent:0px;text-transform:none;white-space:normal;word-spacing:=
0px;background-color:rgb(255,255,255);text-decoration-style:initial;text-de=
coration-color:initial;float:none;display:inline">I actually tried (1), and=
 ran precisely into the size limit that you mentioned.=C2=A0 Because of the=
 size of the database, I needed to split it into a few hundred shards, and =
that was more than the request limit.</span></div></blockquote></div></bloc=
kquote><div><br></div><div>Have you tried adding a Reshuffle transform afte=
r reading from Datastore ? Even if you have fewer number of initial shards,=
 reshuffle could significantly help by allowing further parallelize the nex=
t steps.</div><div><br></div><div><a href=3D"https://github.com/apache/beam=
/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/transforms/Re=
shuffle.java#L65">https://github.com/apache/beam/blob/master/sdks/java/core=
/src/main/java/org/apache/beam/sdk/transforms/Reshuffle.java#L65</a><br></d=
iv><div>=C2=A0<br></div><blockquote class=3D"gmail_quote" style=3D"margin:0=
px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><=
div class=3D"gmail_quote"><blockquote class=3D"gmail_quote" style=3D"margin=
:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"=
><div dir=3D"ltr"><div style=3D"color:rgb(34,34,34);font-family:arial,sans-=
serif;font-size:small;font-style:normal;font-variant-ligatures:normal;font-=
variant-caps:normal;font-weight:400;letter-spacing:normal;text-align:start;=
text-indent:0px;text-transform:none;white-space:normal;word-spacing:0px;tex=
t-decoration-style:initial;text-decoration-color:initial"><br></div><div st=
yle=3D"color:rgb(34,34,34);font-family:arial,sans-serif;font-size:small;fon=
t-style:normal;font-variant-ligatures:normal;font-variant-caps:normal;font-=
weight:400;letter-spacing:normal;text-align:start;text-indent:0px;text-tran=
sform:none;white-space:normal;word-spacing:0px;text-decoration-style:initia=
l;text-decoration-color:initial">I was also considering a slightly differen=
t alternative to (2), such as=C2=A0 adding setQueries(), or setSplitterPTra=
nsform().=C2=A0 The semantics would be identical to that of your ReadAll, b=
ut I&#39;d be able to reuse more of the code that is there.=C2=A0 This gave=
 me interesting results, but it wasn&#39;t as powerful as what I needed.=C2=
=A0 See (2) below.=C2=A0=C2=A0</div><div style=3D"color:rgb(34,34,34);font-=
family:arial,sans-serif;font-size:small;font-style:normal;font-variant-liga=
tures:normal;font-variant-caps:normal;font-weight:400;letter-spacing:normal=
;text-align:start;text-indent:0px;text-transform:none;white-space:normal;wo=
rd-spacing:0px;text-decoration-style:initial;text-decoration-color:initial"=
><br></div></div></blockquote></div></blockquote><div><br></div><div>Could =
you explain how these would be semantically equivalent to ReadAll ? With th=
e ReadAll transform the flow would be somthing like following.</div><div><b=
r></div><div>pipeline.apply(ParDo(MyDoFnThatSplitsQueries())).apply(Datasto=
reIO.ReadAll()).</div><div><br></div><div>&#39;<span style=3D"color:rgb(34,=
34,34);font-family:sans-serif;font-size:13px;font-style:normal;font-variant=
-ligatures:normal;font-variant-caps:normal;font-weight:400;letter-spacing:n=
ormal;text-align:start;text-indent:0px;text-transform:none;white-space:norm=
al;word-spacing:0px;background-color:rgb(255,255,255);text-decoration-style=
:initial;text-decoration-color:initial;float:none;display:inline"><span sty=
le=3D"color:rgb(34,34,34);font-family:sans-serif;font-size:13px;font-style:=
normal;font-variant-ligatures:normal;font-variant-caps:normal;font-weight:4=
00;letter-spacing:normal;text-align:start;text-indent:0px;text-transform:no=
ne;white-space:normal;word-spacing:0px;background-color:rgb(255,255,255);te=
xt-decoration-style:initial;text-decoration-color:initial;float:none;displa=
y:inline">MyDoFnThatSplitsQueries</span>&#39; would be your custom DoFn tha=
t performs splitting (to as many splits as you want).</span></div><div><br>=
</div><div>=C2=A0</div><blockquote class=3D"gmail_quote" style=3D"margin:0p=
x 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><d=
iv class=3D"gmail_quote"><blockquote class=3D"gmail_quote" style=3D"margin:=
0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">=
<div dir=3D"ltr"><div style=3D"color:rgb(34,34,34);font-family:arial,sans-s=
erif;font-size:small;font-style:normal;font-variant-ligatures:normal;font-v=
ariant-caps:normal;font-weight:400;letter-spacing:normal;text-align:start;t=
ext-indent:0px;text-transform:none;white-space:normal;word-spacing:0px;text=
-decoration-style:initial;text-decoration-color:initial"></div><div style=
=3D"color:rgb(34,34,34);font-family:arial,sans-serif;font-size:small;font-s=
tyle:normal;font-variant-ligatures:normal;font-variant-caps:normal;font-wei=
ght:400;letter-spacing:normal;text-align:start;text-indent:0px;text-transfo=
rm:none;white-space:normal;word-spacing:0px;text-decoration-style:initial;t=
ext-decoration-color:initial">The two specific use cases that were motivati=
ng me were that I needed to write code that could</div><div style=3D"color:=
rgb(34,34,34);font-family:arial,sans-serif;font-size:small;font-style:norma=
l;font-variant-ligatures:normal;font-variant-caps:normal;font-weight:400;le=
tter-spacing:normal;text-align:start;text-indent:0px;text-transform:none;wh=
ite-space:normal;word-spacing:0px;text-decoration-style:initial;text-decora=
tion-color:initial">=C2=A0 =C2=A0 =C2=A0 (1) delete a property from all Ent=
itys whose creationTime is between one month and two months ago..</div><div=
 style=3D"color:rgb(34,34,34);font-family:arial,sans-serif;font-size:small;=
font-style:normal;font-variant-ligatures:normal;font-variant-caps:normal;fo=
nt-weight:400;letter-spacing:normal;text-align:start;text-indent:0px;text-t=
ransform:none;white-space:normal;word-spacing:0px;text-decoration-style:ini=
tial;text-decoration-color:initial">=C2=A0 =C2=A0 =C2=A0 (2) delete all Ent=
itys whose creationTime is more than two years ago.</div><div style=3D"colo=
r:rgb(34,34,34);font-family:arial,sans-serif;font-size:small;font-style:nor=
mal;font-variant-ligatures:normal;font-variant-caps:normal;font-weight:400;=
letter-spacing:normal;text-align:start;text-indent:0px;text-transform:none;=
white-space:normal;word-spacing:0px;text-decoration-style:initial;text-deco=
ration-color:initial">I think these are common-enough operations.=C2=A0 For=
 a very large database, it would be nice to be able to open read the small =
piece of it that is needed for your operation.</div><div style=3D"color:rgb=
(34,34,34);font-family:arial,sans-serif;font-size:small;font-style:normal;f=
ont-variant-ligatures:normal;font-variant-caps:normal;font-weight:400;lette=
r-spacing:normal;text-align:start;text-indent:0px;text-transform:none;white=
-space:normal;word-spacing:0px;text-decoration-style:initial;text-decoratio=
n-color:initial"><br></div></div></blockquote></div></blockquote><div><br><=
/div><div>Have you considered adding a filter ParDo that follows the read ?=
 I understand that this would increase the amount of data that you read but=
 I still prefer not allowing users to customize splitting due to serious is=
sues I previously mentioned. Regarding deletion, I don&#39;t think source i=
s the right place for that. We provide a separate transform for deletion. C=
an you try to use that ?</div><div><br></div><div><a href=3D"https://github=
.com/apache/beam/blob/master/sdks/java/io/google-cloud-platform/src/main/ja=
va/org/apache/beam/sdk/io/gcp/datastore/DatastoreV1.java#L1009">https://git=
hub.com/apache/beam/blob/master/sdks/java/io/google-cloud-platform/src/main=
/java/org/apache/beam/sdk/io/gcp/datastore/DatastoreV1.java#L1009</a><br></=
div><div><br></div><div>=C2=A0</div><blockquote class=3D"gmail_quote" style=
=3D"margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding=
-left:1ex"><div class=3D"gmail_quote"><blockquote class=3D"gmail_quote" sty=
le=3D"margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);paddi=
ng-left:1ex"><div dir=3D"ltr"><div style=3D"color:rgb(34,34,34);font-family=
:arial,sans-serif;font-size:small;font-style:normal;font-variant-ligatures:=
normal;font-variant-caps:normal;font-weight:400;letter-spacing:normal;text-=
align:start;text-indent:0px;text-transform:none;white-space:normal;word-spa=
cing:0px;text-decoration-style:initial;text-decoration-color:initial"></div=
><div style=3D"color:rgb(34,34,34);font-family:arial,sans-serif;font-size:s=
mall;font-style:normal;font-variant-ligatures:normal;font-variant-caps:norm=
al;font-weight:400;letter-spacing:normal;text-align:start;text-indent:0px;t=
ext-transform:none;white-space:normal;word-spacing:0px;text-decoration-styl=
e:initial;text-decoration-color:initial">The first is easy to handle.=C2=A0=
 I know the start and end of creationTime, and I can shard it myself.=C2=A0=
 The second requires me to consult the datastore to find out what the small=
est creationTime is in the datastore, and then use it as a[n] (advisory=C2=
=A0 not hard,) lower limit; the query splitter should work well whether the=
 oldest records were four years old or barely more than two years old.=C2=
=A0 =C2=A0For this to be possible, I need access to the Datastore object, a=
nd this Datastore object needs to be passed as some sort of user callback.=
=C2=A0 The QuerySplitter hook already existed and seemed to fit my needs pe=
rfectly.</div><div style=3D"color:rgb(34,34,34);font-family:arial,sans-seri=
f;font-size:small;font-style:normal;font-variant-ligatures:normal;font-vari=
ant-caps:normal;font-weight:400;letter-spacing:normal;text-align:start;text=
-indent:0px;text-transform:none;white-space:normal;word-spacing:0px;text-de=
coration-style:initial;text-decoration-color:initial"><br></div><div style=
=3D"color:rgb(34,34,34);font-family:arial,sans-serif;font-size:small;font-s=
tyle:normal;font-variant-ligatures:normal;font-variant-caps:normal;font-wei=
ght:400;letter-spacing:normal;text-align:start;text-indent:0px;text-transfo=
rm:none;white-space:normal;word-spacing:0px;text-decoration-style:initial;t=
ext-decoration-color:initial">Is there a better alternative that still give=
s me access to the Datastore?</div><div style=3D"color:rgb(34,34,34);font-f=
amily:arial,sans-serif;font-size:small;font-style:normal;font-variant-ligat=
ures:normal;font-variant-caps:normal;font-weight:400;letter-spacing:normal;=
text-align:start;text-indent:0px;text-transform:none;white-space:normal;wor=
d-spacing:0px;text-decoration-style:initial;text-decoration-color:initial">=
<br></div><div style=3D"color:rgb(34,34,34);font-family:arial,sans-serif;fo=
nt-size:small;font-style:normal;font-variant-ligatures:normal;font-variant-=
caps:normal;font-weight:400;letter-spacing:normal;text-align:start;text-ind=
ent:0px;text-transform:none;white-space:normal;word-spacing:0px;text-decora=
tion-style:initial;text-decoration-color:initial"><br></div><div style=3D"c=
olor:rgb(34,34,34);font-family:arial,sans-serif;font-size:small;font-style:=
normal;font-variant-ligatures:normal;font-variant-caps:normal;font-weight:4=
00;letter-spacing:normal;text-align:start;text-indent:0px;text-transform:no=
ne;white-space:normal;word-spacing:0px;text-decoration-style:initial;text-d=
ecoration-color:initial"><br></div><div style=3D"color:rgb(34,34,34);font-f=
amily:arial,sans-serif;font-size:small;font-style:normal;font-variant-ligat=
ures:normal;font-variant-caps:normal;font-weight:400;letter-spacing:normal;=
text-align:start;text-indent:0px;text-transform:none;white-space:normal;wor=
d-spacing:0px;text-decoration-style:initial;text-decoration-color:initial">=
<br></div><div style=3D"color:rgb(34,34,34);font-family:arial,sans-serif;fo=
nt-size:small;font-style:normal;font-variant-ligatures:normal;font-variant-=
caps:normal;font-weight:400;letter-spacing:normal;text-align:start;text-ind=
ent:0px;text-transform:none;white-space:normal;word-spacing:0px;text-decora=
tion-style:initial;text-decoration-color:initial"><br></div><div style=3D"c=
olor:rgb(34,34,34);font-family:arial,sans-serif;font-size:small;font-style:=
normal;font-variant-ligatures:normal;font-variant-caps:normal;font-weight:4=
00;letter-spacing:normal;text-align:start;text-indent:0px;text-transform:no=
ne;white-space:normal;word-spacing:0px;text-decoration-style:initial;text-d=
ecoration-color:initial"><br></div></div><div class=3D"gmail_extra"><br><di=
v class=3D"gmail_quote">On Thu, May 3, 2018 at 2:52 AM, Chamikara Jayalath =
<span dir=3D"ltr">&lt;<a href=3D"mailto:chamikara@google.com" target=3D"_bl=
ank">chamikara@google.com</a>&gt;</span> wrote:<br><blockquote class=3D"gma=
il_quote" style=3D"margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,2=
04,204);padding-left:1ex"><div dir=3D"ltr"><div>Thanks. IMHO it might be be=
tter to perform this splitting as a part of your pipeline instead of making=
 source splitting customizable. The reason is, it&#39;s easy for users to s=
hoot themselves on the foot if we allow specifying a custom splitter. A bug=
 in a custom QuerySplitter can result in a hard to catch data loss or data =
duplication bug. So I&#39;d rather not make it a part of the user API.</div=
><div><br></div><div>I can think of two ways for performing this splitting =
as a part of your pipeline.</div><div>(1) Split the query during job constr=
uction and create a source per query. This can be followed by a Flatten tra=
nsform that creates a single PCollection. (Once caveat is, you might run in=
to 10MB request size limit if you create two many splits here. So try reduc=
ing the number of splits if you ran into this).</div><div>(2) Add a ReadAll=
 transform to DatastoreIO. This will allow you to precede the step that per=
forms reading by a ParDo step that splits your query and create a PCollecti=
on of queries. You should not run into size limits here since splitting hap=
pens in the data plane.</div><div><br></div><div>Thanks,</div><div>Cham</di=
v><div><div class=3D"gmail-m_8796901512775111502m_8876749762858822988h5"><d=
iv><br></div><div><div><div dir=3D"ltr"><div><div class=3D"gmail_quote"><di=
v dir=3D"ltr">On Wed, May 2, 2018 at 12:50 PM Frank Yellin &lt;<a href=3D"m=
ailto:fy@fyellin.com" target=3D"_blank">fy@fyellin.com</a>&gt; wrote:<br></=
div><blockquote class=3D"gmail_quote" style=3D"margin:0px 0px 0px 0.8ex;bor=
der-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir=3D"ltr"><div=
 style=3D"color:rgb(34,34,34);font-family:arial,sans-serif;font-size:12.8px=
;font-style:normal;font-variant-ligatures:normal;font-variant-caps:normal;f=
ont-weight:400;letter-spacing:normal;text-align:start;text-indent:0px;text-=
transform:none;white-space:normal;word-spacing:0px;background-color:rgb(255=
,255,255);text-decoration-style:initial;text-decoration-color:initial">TLDR=
:=C2=A0</div><div style=3D"color:rgb(34,34,34);font-family:arial,sans-serif=
;font-size:12.8px;font-style:normal;font-variant-ligatures:normal;font-vari=
ant-caps:normal;font-weight:400;letter-spacing:normal;text-align:start;text=
-indent:0px;text-transform:none;white-space:normal;word-spacing:0px;backgro=
und-color:rgb(255,255,255);text-decoration-style:initial;text-decoration-co=
lor:initial">Is it okay for me to expose<span>=C2=A0</span><font face=3D"mo=
nospace, monospace">Datastore</font>=C2=A0in apache beam&#39;s<span>=C2=A0<=
/span><font face=3D"monospace, monospace">DatastoreIO</font>, and thus indi=
rectly expose<span>=C2=A0</span><font face=3D"monospace, monospace">com.goo=
gle.rpc.Code</font>?</div><div style=3D"color:rgb(34,34,34);font-family:ari=
al,sans-serif;font-size:12.8px;font-style:normal;font-variant-ligatures:nor=
mal;font-variant-caps:normal;font-weight:400;letter-spacing:normal;text-ali=
gn:start;text-indent:0px;text-transform:none;white-space:normal;word-spacin=
g:0px;background-color:rgb(255,255,255);text-decoration-style:initial;text-=
decoration-color:initial">Is there a better solution?</div><div style=3D"co=
lor:rgb(34,34,34);font-family:arial,sans-serif;font-size:12.8px;font-style:=
normal;font-variant-ligatures:normal;font-variant-caps:normal;font-weight:4=
00;letter-spacing:normal;text-align:start;text-indent:0px;text-transform:no=
ne;white-space:normal;word-spacing:0px;background-color:rgb(255,255,255);te=
xt-decoration-style:initial;text-decoration-color:initial"><br></div><div s=
tyle=3D"color:rgb(34,34,34);font-family:arial,sans-serif;font-size:12.8px;f=
ont-style:normal;font-variant-ligatures:normal;font-variant-caps:normal;fon=
t-weight:400;letter-spacing:normal;text-align:start;text-indent:0px;text-tr=
ansform:none;white-space:normal;word-spacing:0px;background-color:rgb(255,2=
55,255);text-decoration-style:initial;text-decoration-color:initial"><br></=
div><span style=3D"color:rgb(34,34,34);font-family:arial,sans-serif;font-si=
ze:12.8px;font-style:normal;font-variant-ligatures:normal;font-variant-caps=
:normal;font-weight:400;letter-spacing:normal;text-align:start;text-indent:=
0px;text-transform:none;white-space:normal;word-spacing:0px;background-colo=
r:rgb(255,255,255);text-decoration-style:initial;text-decoration-color:init=
ial;float:none;display:inline">As I explain in=C2=A0</span><a href=3D"https=
://issues.apache.org/jira/browse/BEAM-4186" style=3D"color:rgb(17,85,204);f=
ont-family:arial,sans-serif;font-size:12.8px;font-style:normal;font-variant=
-ligatures:normal;font-variant-caps:normal;font-weight:400;letter-spacing:n=
ormal;text-align:start;text-indent:0px;text-transform:none;white-space:norm=
al;word-spacing:0px;background-color:rgb(255,255,255)" target=3D"_blank">Be=
am 4186</a><span style=3D"color:rgb(34,34,34);font-family:arial,sans-serif;=
font-size:12.8px;font-style:normal;font-variant-ligatures:normal;font-varia=
nt-caps:normal;font-weight:400;letter-spacing:normal;text-align:start;text-=
indent:0px;text-transform:none;white-space:normal;word-spacing:0px;backgrou=
nd-color:rgb(255,255,255);text-decoration-style:initial;text-decoration-col=
or:initial;float:none;display:inline">, I would like to be able to extend<s=
pan>=C2=A0</span></span><font face=3D"monospace, monospace" style=3D"color:=
rgb(34,34,34);font-size:12.8px;font-style:normal;font-variant-ligatures:nor=
mal;font-variant-caps:normal;font-weight:400;letter-spacing:normal;text-ali=
gn:start;text-indent:0px;text-transform:none;white-space:normal;word-spacin=
g:0px;background-color:rgb(255,255,255);text-decoration-style:initial;text-=
decoration-color:initial">DatastoreV1.Read</font><span style=3D"color:rgb(3=
4,34,34);font-family:arial,sans-serif;font-size:12.8px;font-style:normal;fo=
nt-variant-ligatures:normal;font-variant-caps:normal;font-weight:400;letter=
-spacing:normal;text-align:start;text-indent:0px;text-transform:none;white-=
space:normal;word-spacing:0px;background-color:rgb(255,255,255);text-decora=
tion-style:initial;text-decoration-color:initial;float:none;display:inline"=
><span>=C2=A0</span>to have a</span><div style=3D"color:rgb(34,34,34);font-=
family:arial,sans-serif;font-size:12.8px;font-style:normal;font-variant-lig=
atures:normal;font-variant-caps:normal;font-weight:400;letter-spacing:norma=
l;text-align:start;text-indent:0px;text-transform:none;white-space:normal;w=
ord-spacing:0px;background-color:rgb(255,255,255);text-decoration-style:ini=
tial;text-decoration-color:initial"><font face=3D"monospace, monospace">=C2=
=A0 =C2=A0 =C2=A0 =C2=A0withQuerySplitter(QuerrySplitter querySplitter)</fo=
nt></div><div style=3D"color:rgb(34,34,34);font-family:arial,sans-serif;fon=
t-size:12.8px;font-style:normal;font-variant-ligatures:normal;font-variant-=
caps:normal;font-weight:400;letter-spacing:normal;text-align:start;text-ind=
ent:0px;text-transform:none;white-space:normal;word-spacing:0px;background-=
color:rgb(255,255,255);text-decoration-style:initial;text-decoration-color:=
initial">method, which would use an alternative query splitter.=C2=A0 =C2=
=A0The standard one shards by key and is very limited.</div><div style=3D"c=
olor:rgb(34,34,34);font-family:arial,sans-serif;font-size:12.8px;font-style=
:normal;font-variant-ligatures:normal;font-variant-caps:normal;font-weight:=
400;letter-spacing:normal;text-align:start;text-indent:0px;text-transform:n=
one;white-space:normal;word-spacing:0px;background-color:rgb(255,255,255);t=
ext-decoration-style:initial;text-decoration-color:initial"><br></div><div =
style=3D"color:rgb(34,34,34);font-family:arial,sans-serif;font-size:12.8px;=
font-style:normal;font-variant-ligatures:normal;font-variant-caps:normal;fo=
nt-weight:400;letter-spacing:normal;text-align:start;text-indent:0px;text-t=
ransform:none;white-space:normal;word-spacing:0px;background-color:rgb(255,=
255,255);text-decoration-style:initial;text-decoration-color:initial">I hav=
e already written such a query splitter.=C2=A0 In fact, the query splitter =
I&#39;ve written goes further than specified in the beam, and reads the min=
imum or maximum value of the field from the datastore if no minimum or maxi=
mum is specified in the query, and uses that value for the sharding.=C2=A0 =
=C2=A0I can write:</div><div style=3D"color:rgb(34,34,34);font-family:arial=
,sans-serif;font-size:12.8px;font-style:normal;font-variant-ligatures:norma=
l;font-variant-caps:normal;font-weight:400;letter-spacing:normal;text-align=
:start;text-indent:0px;text-transform:none;white-space:normal;word-spacing:=
0px;background-color:rgb(255,255,255);text-decoration-style:initial;text-de=
coration-color:initial"><font face=3D"monospace, monospace">=C2=A0 =C2=A0 =
=C2=A0 =C2=A0SELECT * FROM ledger where type =3D &#39;purchase&#39;</font><=
/div><div style=3D"color:rgb(34,34,34);font-family:arial,sans-serif;font-si=
ze:12.8px;font-style:normal;font-variant-ligatures:normal;font-variant-caps=
:normal;font-weight:400;letter-spacing:normal;text-align:start;text-indent:=
0px;text-transform:none;white-space:normal;word-spacing:0px;background-colo=
r:rgb(255,255,255);text-decoration-style:initial;text-decoration-color:init=
ial">and then ask it to shard on the<span>=C2=A0</span><font face=3D"monosp=
ace, monospace">eventTime</font>, and it will shard nicely!=C2=A0 I am work=
ing with the Datastore folks to separately add my new query splitter as an =
option in<span>=C2=A0</span><font face=3D"monospace, monospace">DatastoreHe=
lper</font>.=C2=A0<br><br></div><div style=3D"color:rgb(34,34,34);font-fami=
ly:arial,sans-serif;font-size:12.8px;font-style:normal;font-variant-ligatur=
es:normal;font-variant-caps:normal;font-weight:400;letter-spacing:normal;te=
xt-align:start;text-indent:0px;text-transform:none;white-space:normal;word-=
spacing:0px;background-color:rgb(255,255,255);text-decoration-style:initial=
;text-decoration-color:initial"><br></div><div style=3D"color:rgb(34,34,34)=
;font-family:arial,sans-serif;font-size:12.8px;font-style:normal;font-varia=
nt-ligatures:normal;font-variant-caps:normal;font-weight:400;letter-spacing=
:normal;text-align:start;text-indent:0px;text-transform:none;white-space:no=
rmal;word-spacing:0px;background-color:rgb(255,255,255);text-decoration-sty=
le:initial;text-decoration-color:initial">I have already written the code t=
o add<span>=C2=A0</span><font face=3D"monospace, monospace">withQuerySplitt=
er</font>.=C2=A0=C2=A0<br><div><br><div>=C2=A0 =C2=A0 =C2=A0 =C2=A0<a href=
=3D"https://github.com/apache/beam/pull/5246" style=3D"color:rgb(17,85,204)=
" target=3D"_blank">https://github.com/apache/beam/pull/5246</a><br></div><=
/div><div><br></div><div>However the problem is that I am increasing the &q=
uot;surface API&quot; of Dataflow.=C2=A0=C2=A0</div><div>=C2=A0 =C2=A0 =C2=
=A0 =C2=A0<font face=3D"monospace, monospace">QuerySplitter</font><span>=C2=
=A0</span>exposes=C2=A0<font face=3D"monospace, monospace">Datastore</font>=
=C2=A0 exposes=C2=A0<font face=3D"monospace, monospace">DatastoreException<=
/font>=C2=A0 exposes=C2=A0<font face=3D"monospace, monospace">com.google.rp=
c.Code</font></div><div>and<span>=C2=A0</span><font face=3D"monospace, mono=
space">com.google.rpc.Code</font><span>=C2=A0</span>is not (yet) part of th=
e API surface.</div><div><br></div><div>As a solution, I&#39;ve added packa=
ge com.google.rpc to the list of classes exposed.=C2=A0 This package contai=
ns protobuf enums.=C2=A0 Is this okay?=C2=A0 Is there a better solution?</d=
iv><div><br></div><div>Thanks.</div></div><br></div>
</blockquote></div></div></div></div></div></div></div></div>
</blockquote></div><br></div>
</blockquote></div>
</blockquote></div></div>

--00000000000096386b056b64188b--