From dev-return-23986-archive-asf-public=cust-asf.ponee.io@beam.apache.org  Thu Jul  2 02:18:33 2020
Return-Path: <dev-return-23986-archive-asf-public=cust-asf.ponee.io@beam.apache.org>
X-Original-To: archive-asf-public@cust-asf.ponee.io
Delivered-To: archive-asf-public@cust-asf.ponee.io
Received: from mail.apache.org (hermes.apache.org [207.244.88.153])
	by mx-eu-01.ponee.io (Postfix) with SMTP id 29AC7180638
	for <archive-asf-public@cust-asf.ponee.io>; Thu,  2 Jul 2020 04:18:33 +0200 (CEST)
Received: (qmail 89052 invoked by uid 500); 2 Jul 2020 02:18:31 -0000
Mailing-List: contact dev-help@beam.apache.org; run by ezmlm
Precedence: bulk
List-Help: <mailto:dev-help@beam.apache.org>
List-Unsubscribe: <mailto:dev-unsubscribe@beam.apache.org>
List-Post: <mailto:dev@beam.apache.org>
List-Id: <dev.beam.apache.org>
Reply-To: dev@beam.apache.org
Delivered-To: mailing list dev@beam.apache.org
Received: (qmail 89034 invoked by uid 99); 2 Jul 2020 02:18:30 -0000
Received: from pnap-us-west-generic-nat.apache.org (HELO spamd3-us-west.apache.org) (209.188.14.142)
    by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 02 Jul 2020 02:18:30 +0000
Received: from localhost (localhost [127.0.0.1])
	by spamd3-us-west.apache.org (ASF Mail Server at spamd3-us-west.apache.org) with ESMTP id 0DC2A181434
	for <dev@beam.apache.org>; Thu,  2 Jul 2020 02:18:30 +0000 (UTC)
X-Virus-Scanned: Debian amavisd-new at spamd3-us-west.apache.org
X-Spam-Flag: NO
X-Spam-Score: 0.202
X-Spam-Level:
X-Spam-Status: No, score=0.202 tagged_above=-999 required=6.31
	tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, HTML_MESSAGE=0.2,
	RCVD_IN_DNSWL_NONE=-0.0001, RCVD_IN_MSPIKE_H3=0.001,
	RCVD_IN_MSPIKE_WL=0.001, SPF_HELO_NONE=0.001, SPF_PASS=-0.001]
	autolearn=disabled
Authentication-Results: spamd3-us-west.apache.org (amavisd-new);
	dkim=pass (2048-bit key) header.d=frantil-com.20150623.gappssmtp.com
Received: from mx1-ec2-va.apache.org ([10.40.0.8])
	by localhost (spamd3-us-west.apache.org [10.40.0.10]) (amavisd-new, port 10024)
	with ESMTP id JidOwR3RbUNX for <dev@beam.apache.org>;
	Thu,  2 Jul 2020 02:18:26 +0000 (UTC)
Received-SPF: Pass (mailfrom) identity=mailfrom; client-ip=209.85.161.52; helo=mail-oo1-f52.google.com; envelope-from=robert@frantil.com; receiver=<UNKNOWN> 
Received: from mail-oo1-f52.google.com (mail-oo1-f52.google.com [209.85.161.52])
	by mx1-ec2-va.apache.org (ASF Mail Server at mx1-ec2-va.apache.org) with ESMTPS id C24F8BC0B4
	for <dev@beam.apache.org>; Thu,  2 Jul 2020 02:18:25 +0000 (UTC)
Received: by mail-oo1-f52.google.com with SMTP id d125so1734278oob.0
        for <dev@beam.apache.org>; Wed, 01 Jul 2020 19:18:25 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=frantil-com.20150623.gappssmtp.com; s=20150623;
        h=mime-version:references:in-reply-to:from:date:message-id:subject:to
         :cc;
        bh=bvvxhFN0ZA4xTFhVzB1PoH76w1tR++DRSP/ObJ6DKHs=;
        b=tXdzBV8FTEdbFHiFymKP4mBlaj85eUkFeCsSYG/7F+md9sXj/SeHzABVppEkBx7Nbg
         YF8aTjjSI8lpEr1N/mbVztQvLN+pqsMUqMPfAdAWpp6uD5CCOxdZA4NCg0prWePbkrVx
         qHKaoGrIQIl4vraidiM6MLONEcYzCbKPxZvZykom/aiu7YDhGqjvTj9244cYveceOj3s
         uLRf7YWObINZvbw0hQsIf0iQe4mdG9ZQH5kO9MjxhpjSNqafZ8yD2uo5jOHwSlDzh1nx
         cN5lE4vmUInnS+alazTluTNSc01Bb62/pA3pSj6qXkAfY2sbuA0u6+zUJLhjDD1I+R8D
         dCyQ==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20161025;
        h=x-gm-message-state:mime-version:references:in-reply-to:from:date
         :message-id:subject:to:cc;
        bh=bvvxhFN0ZA4xTFhVzB1PoH76w1tR++DRSP/ObJ6DKHs=;
        b=LWF/vEOqC2sIJZ/dDbmqTCteLJg2vjvUFDIQDWq3O8Z01/CHxNfHTOWzARDGjD6kFc
         jGEvhYPOzxSoeNgCjPqgxAbEpc2A96+1HsONggRsq51Slr2LmGt7IktQzr0Ft2wK1juR
         D8gIBHbnQ289ulhmPQcZlRYDYvBr5lbIYkbmCndmkLfvny8pJRg42jv/0BGTdSlhZdWr
         UhYtxYmFIaDOyBjdDX7vgm0VUxBR1n5e3Po236UqBpq4RAwtQKCZY8N7OjLL+YMoOB0P
         CsQ7sPrIhAxrt4AfQEZ10gM+u9xxgSQKl32jnSbeE3a0EFOQbn1hR2xqN4v0yAY53aKT
         tbcA==
X-Gm-Message-State: AOAM533aqod77q2YqZn/aaOiCxhJuZX98tODatqtuk8ti5vLdOTGUQh1
	Xqo0D3mC8iSltFWl/LrUS7nliC8ZvQ7v57oBiuC9zzY1uVHttw==
X-Google-Smtp-Source: ABdhPJwzxcI7xpktGrJclH1dJK6LDDf3V1lVSGLC/caSabFwuW/ETBqIokXUUrUPcJW2OMzF//9MT2V4RYtZxuOTTqQ=
X-Received: by 2002:a4a:4cc1:: with SMTP id a184mr10585610oob.62.1593656298728;
 Wed, 01 Jul 2020 19:18:18 -0700 (PDT)
MIME-Version: 1.0
References: <CAF9t7_5awS-XaLAqHz4a+jnvYcKSqwRh369CR9mV9vg-7S2MTg@mail.gmail.com>
In-Reply-To: <CAF9t7_5awS-XaLAqHz4a+jnvYcKSqwRh369CR9mV9vg-7S2MTg@mail.gmail.com>
From: Robert Burke <robert@frantil.com>
Date: Wed, 1 Jul 2020 19:18:06 -0700
Message-ID: <CAO2KYsnV9s_GNvgJ8RBFBZui5F7+BLKZRvH6HvKE_X6D7x=MQA@mail.gmail.com>
Subject: Re: XLang sub-graph representation within the SDKs pipeline types
To: dev <dev@beam.apache.org>
Cc: Harsh Vardhan <ananvay@google.com>, Chamikara Jayalath <chamikara@google.com>, 
	Robert Bradshaw <robertwb@google.com>, Kevin Puthusseri <kevinsijo@google.com>, 
	Robert Burke <rebo@google.com>, Brian Hulette <bhulette@google.com>, Heejong Lee <heejong@google.com>
Content-Type: multipart/alternative; boundary="00000000000094828c05a96c0529"

--00000000000094828c05a96c0529
Content-Type: text/plain; charset="UTF-8"

From the Go SDK side, it was built that way nearly from the start.
Historically there was a direct SDK rep -> Dataflow rep conversion, but
that's been replaced with a SDK rep -> Beam Proto -> Dataflow rep
conversion.

In particular, this approach had a few benefits: easier to access local
context for pipeline validation at construction time, to permit as early a
failure as possible, which might be easier with native language constructs
vs beam representations of them.(Eg. DoFns not matching ParDo & Collection
types, and similar)
Protos are convenient, but impose certain structure on how the pipeline
graph is handled. (This isn't to say an earlier conversion isn't possible,
one can do almost anything in code, but it lets the structure be optimised
for this case.)

The big advantage of translating from Beam proto -> to Dataflow Rep is that
the Dataflow Rep can get the various unique IDs that are mandated for the
Beam proto process.

However, the same can't really be said for the other way around.  A good
question is "when should the unique IDs be assigned?"

While I'm not working on adding XLang to the Go SDK directly (that would be
our wonderful intern, Kevin),  I've kind of pictured that the process was
to provide the Expansion service with unique placeholders if unable to
provide the right IDs, and substitute them in returned pipeline graph
segment afterwards, once that is known. That is, we can be relatively
certain that the expansion service will be self consistent, but it's the
SDK requesting the expansion's responsibility to ensure they aren't
colliding with the primary SDKs pipeline ids.

Otherwise, we could probably recommend a translation protocol (if one
doesn't exist already, it probably does) and when XLang expansions are to
happen in the SDK -> beam proto process. So something like Pass 1, intern
all coders and Pcollections, Pass 2 intern all DoFns and environments, Pass
3 expand Xlang, ... Etc.

The other half of this is when happens when Going from Beam proto a
-> SDK? This happens during pipeline execution, but at least in the Go SDK
partly happens when creating the Dataflow rep. In particular, Coder
reference values only have a populated ID when they've been "rehydrated"
from the Beam proto, since the Beam Proto is the first place where such IDs
are correctly assigned.

Tl;dr; i think the right question to sort out is when should IDs be
expected to be assigned and available during pipeline construction.

On Wed, Jul 1, 2020, 6:34 PM Luke Cwik <lcwik@google.com> wrote:

> It seems like we keep running into translation issues with XLang due to
> how it is represented in the SDK. (e.g. Brian's work on context map due to
> loss of coder ids, Heejong's work related to missing environment ids on
> windowing strategies).
>
> I understand that there is an effort that is Dataflow specific where the
> conversion of the Beam proto -> Dataflow API (v1b3) will help with some
> issues but it still requires the SDK pipeline representation -> Beam proto
> to occur correctly which won't be fixed by the Dataflow specific effort.
>
> Why did we go with the current approach?
>
> What other ways could we do this?
>

--00000000000094828c05a96c0529
Content-Type: text/html; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

<div dir=3D"auto"><div>From the Go SDK side, it was built that way nearly f=
rom the start. Historically there was a direct SDK rep -&gt; Dataflow rep c=
onversion, but that&#39;s been replaced with a SDK rep -&gt; Beam Proto -&g=
t; Dataflow rep conversion.<div dir=3D"auto"><br></div><div dir=3D"auto">In=
 particular, this approach had a few benefits: easier to access local conte=
xt for pipeline validation at construction time, to permit as early a failu=
re as possible, which might be easier with native language constructs vs be=
am representations of them.(Eg. DoFns not matching ParDo &amp; Collection t=
ypes, and similar)</div><div dir=3D"auto">Protos are convenient, but impose=
 certain structure on how the pipeline graph is handled. (This isn&#39;t to=
 say an earlier conversion isn&#39;t possible, one can do almost anything i=
n code, but it lets the structure be optimised for this case.)</div><div di=
r=3D"auto"><br></div>The big advantage of translating from Beam proto -&gt;=
 to Dataflow Rep is that the Dataflow Rep can get the various unique IDs th=
at are mandated for the Beam proto process.</div><div dir=3D"auto"><br></di=
v><div dir=3D"auto">However, the same can&#39;t really be said for the othe=
r way around.=C2=A0 A good question is &quot;when should the unique IDs be =
assigned?&quot;</div><div dir=3D"auto"><br></div><div dir=3D"auto">While I&=
#39;m not working on adding XLang to the Go SDK directly (that would be our=
 wonderful intern, Kevin),=C2=A0 I&#39;ve kind of pictured that the process=
 was to provide the Expansion service with unique placeholders if unable to=
 provide the right IDs, and substitute them in returned pipeline graph segm=
ent afterwards, once that is known. That is, we can be relatively certain t=
hat the expansion service will be self consistent, but it&#39;s the SDK req=
uesting the expansion&#39;s responsibility to ensure they aren&#39;t collid=
ing with the primary SDKs pipeline ids.</div><div dir=3D"auto"><br></div><d=
iv dir=3D"auto">Otherwise, we could probably recommend a translation protoc=
ol (if one doesn&#39;t exist already, it probably does) and when XLang expa=
nsions are to happen in the SDK -&gt; beam proto process. So something like=
 Pass 1, intern all coders and Pcollections, Pass 2 intern all DoFns and en=
vironments, Pass 3 expand Xlang, ... Etc.</div><div dir=3D"auto"><br></div>=
<div dir=3D"auto">The other half of this is when happens when Going from Be=
am proto a</div><div dir=3D"auto">-&gt; SDK? This happens during pipeline e=
xecution, but at least in the Go SDK partly happens when creating the Dataf=
low rep. In particular, Coder reference values only have a populated ID whe=
n they&#39;ve been &quot;rehydrated&quot; from the Beam proto, since the Be=
am Proto is the first place where such IDs are correctly assigned.</div><di=
v dir=3D"auto"><br></div><div dir=3D"auto">Tl;dr; i think the right questio=
n to sort out is when should IDs be expected to be assigned and available d=
uring pipeline construction.</div><div dir=3D"auto"><br><div class=3D"gmail=
_quote" dir=3D"auto"><div dir=3D"ltr" class=3D"gmail_attr">On Wed, Jul 1, 2=
020, 6:34 PM Luke Cwik &lt;<a href=3D"mailto:lcwik@google.com">lcwik@google=
.com</a>&gt; wrote:<br></div><blockquote class=3D"gmail_quote" style=3D"mar=
gin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir=3D"ltr=
">It seems like we keep running into translation issues with XLang due to h=
ow it is represented in the SDK. (e.g. Brian&#39;s work on context map due =
to loss of coder ids, Heejong&#39;s work related to missing environment ids=
 on windowing strategies).<div><br></div><div>I understand that there is an=
 effort that is Dataflow specific where the conversion of the Beam proto -&=
gt; Dataflow API (v1b3) will help with some issues but it=C2=A0still requir=
es the SDK pipeline representation -&gt; Beam proto to occur correctly whic=
h won&#39;t be fixed by the Dataflow specific effort.</div><div><br></div><=
div><div>Why did we go with the current approach?</div><div><br></div><div>=
</div></div><div>What other ways could we do this?</div></div>
</blockquote></div></div></div>

--00000000000094828c05a96c0529--