From dev-return-18929-archive-asf-public=cust-asf.ponee.io@beam.apache.org Wed Sep 18 18:18:24 2019 Return-Path: X-Original-To: archive-asf-public@cust-asf.ponee.io Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [207.244.88.153]) by mx-eu-01.ponee.io (Postfix) with SMTP id E6A16180652 for ; Wed, 18 Sep 2019 20:18:23 +0200 (CEST) Received: (qmail 64169 invoked by uid 500); 18 Sep 2019 18:18:23 -0000 Mailing-List: contact dev-help@beam.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@beam.apache.org Delivered-To: mailing list dev@beam.apache.org Received: (qmail 64106 invoked by uid 99); 18 Sep 2019 18:18:22 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd2-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 18 Sep 2019 18:18:22 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd2-us-west.apache.org (ASF Mail Server at spamd2-us-west.apache.org) with ESMTP id 52ABB1A4DC2 for ; Wed, 18 Sep 2019 18:18:22 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd2-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 1.8 X-Spam-Level: * X-Spam-Status: No, score=1.8 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, HTML_MESSAGE=2, RCVD_IN_DNSWL_NONE=-0.0001, SPF_HELO_NONE=0.001, SPF_PASS=-0.001] autolearn=disabled Authentication-Results: spamd2-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com Received: from mx1-he-de.apache.org ([10.40.0.8]) by localhost (spamd2-us-west.apache.org [10.40.0.9]) (amavisd-new, port 10024) with ESMTP id aqoKFqqrqaZa for ; Wed, 18 Sep 2019 18:18:20 +0000 (UTC) Received-SPF: Pass (mailfrom) identity=mailfrom; client-ip=2607:f8b0:4864:20::d42; helo=mail-io1-xd42.google.com; envelope-from=xinyuliu.us@gmail.com; receiver= Received: from mail-io1-xd42.google.com (mail-io1-xd42.google.com [IPv6:2607:f8b0:4864:20::d42]) by mx1-he-de.apache.org (ASF Mail Server at mx1-he-de.apache.org) with ESMTPS id 6AE717DC5D for ; Wed, 18 Sep 2019 18:18:19 +0000 (UTC) Received: by mail-io1-xd42.google.com with SMTP id f12so1328258iog.12 for ; Wed, 18 Sep 2019 11:18:19 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=WZ5hTjuYkYVA9WnGjFksXm0afHn7DytAEqhqG/42uCw=; b=oNlBBg7850q87M8arOog6oQyI4nsGrGsouliNNa1HFXDtVtOlro2+YgWAtXb15l9ii mamYR6rgIi/WqUhYy3qj53/8rkqsSE9TaBoz4Jn8/fsVtxdr0e4YZROOGWaasMMyggZN 0o1Y9IH9+o506vRtbzMNiYyaFNprZCm3Cn9MRV698QIJ0hs3HLN7gFA53qRw1GfpgfHi Lk0CtnP2/Lu8SoMuMYv7EvhUwEx157M64NQ8tW/T2gOv03wmn94tl1BhsTpgMV2040Ub 7D7PcYSgq7rmaecj7w715eeNnaiemJf+Y8NsjLFIba7f1cLYajoKcQi2mBcpzBR5LI7d DjOA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=WZ5hTjuYkYVA9WnGjFksXm0afHn7DytAEqhqG/42uCw=; b=tVRSk6qEGOF17MFlQgEoXALyb1LsvgwHrvVOefPb1JRa/aQMZLEcmb9PzEWJvPqvhd 8j8s2np0z2x1gbFC1RvD2L6ovpajd5j1P8mJcJBVJPsEwDNe060oVQ6MQwrf+IHhvPbX 1oL7VWycXtjAnjAY3wBXP2ZlchMR6LsF5EHvLSAoRguIIfwy09J9yDgv9/chZSc7DsYX WpYk/4oRmHlNfv+FRQI8STUTIxT78QCmjCOB7qwzoo5cMkPWnuuNe5huzp61DOxmV+Ny 3lganZXPvS/9C/JohvNrc/WEX78ONq9yQ0iYinfezmXfb2wdujyW+e0h0bbnJEjb3Rbh seYA== X-Gm-Message-State: APjAAAVNe5LfLVKiG4pnaGFlPXA9o1UsSHeEIKwPn+eOvLOYNgkjUSuG jMvuMa7ruY9JftfEnsrnoqyXCGWuDp6ac6d6cY4BEg== X-Google-Smtp-Source: APXvYqxJNrEmS6mhv3MQpGvXSLmcS6WoqZ+d0g9+mbn02IAVIMCOClh+4mBHjPTVtMmI9DoqR4TG4HnFES2AqSbE5LU= X-Received: by 2002:a6b:c302:: with SMTP id t2mr6381379iof.217.1568830698032; Wed, 18 Sep 2019 11:18:18 -0700 (PDT) MIME-Version: 1.0 References: <6e896d6c0c263c0185a990e7ac299bdb5600e840.camel@apache.org> In-Reply-To: From: Xinyu Liu Date: Wed, 18 Sep 2019 11:18:07 -0700 Message-ID: Subject: Re: Pointers on Contributing to Structured Streaming Spark Runner To: dev@beam.apache.org Cc: rahul patwari Content-Type: multipart/alternative; boundary="000000000000781a4f0592d7dc92" --000000000000781a4f0592d7dc92 Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable Alexey and Etienne: I'm very happy to join the sync-up meeting. Please forward the meeting info to me. I am based in California, US and hopefully the time will work :). Thanks, Xinyu On Wed, Sep 18, 2019 at 6:39 AM Etienne Chauchot wrote: > Hi Xinyu, > > Thanks for offering help ! My comments are inline: > > Le vendredi 13 septembre 2019 =C3=A0 12:16 -0700, Xinyu Liu a =C3=A9crit = : > > Hi, Etienne, > > The slides are very informative! Thanks for sharing the details about how > the Beam API are mapped into Spark Structural Streaming. > > > Thanks ! > > We (LinkedIn) are also interested in trying the new SparkRunner to run > Beam pipeine in batch, and contribute to it too. From my understanding, > seems the functionality on batch side is mostly complete and covers quite= a > large percentage of the tests (a few missing pieces like state and timer = in > ParDo and SDF). > > > Correct, it passes 89% of the tests, but there is more than SDF, state an= d > timer missing, there is also ongoing encoders work that I would like to > commit/push before merging. > > If so, is it possible to merge the new runner sooner into master so it's > much easier for us to pull it in (we have an internal fork) and contribut= e > back? > > > Sure, see my other mail on this thread. As Alexey mentioned, please join > the sync meeting we have, the more the merrier ! > > > Also curious about the scheme part in the runner. Seems we can leverage > the schema-aware work in PCollection and translate from Beam schema to > Spark, so it can be optimized in the planner layer. It will be great to > hear back your plans on that. > > > Well, it is not designed yet but, if you remember my talk, we need to > store beam windowing information with the data itself, so ending up havin= g > a dataset . One lead that was discussed is to store it as = a > Spark schema such as this: > > 1. field1: binary data for beam windowing information (cannot be mapped t= o > fields because beam windowing info is complex structure) > > 2. fields of data as defined in the Beam schema if there is one > > > Congrats on this great work! > > Thanks ! > > Best, > > Etienne > > Thanks, > Xinyu > > On Wed, Sep 11, 2019 at 6:02 PM Rui Wang wrote: > > Hello Etienne, > > Your slide mentioned that streaming mode development is blocked because > Spark lacks supporting multiple-aggregations in its streaming mode but > design is ongoing. Do you have a link or something else to their design > discussion/doc? > > > -Rui > > On Wed, Sep 11, 2019 at 5:10 PM Etienne Chauchot > wrote: > > Hi Rahul, > Sure, and great ! Thanks for proposing ! > If you want details, here is the presentation I did 30 mins ago at the > apachecon. You will find the video on youtube shortly but in the meantime= , > here is my presentation slides. > > And here is the structured streaming branch. I'll be happy to review your > PRs, thanks ! > > > https://github.com/apache/beam/tree/spark-runner_structured-streaming > > Best > Etienne > > Le mercredi 11 septembre 2019 =C3=A0 16:37 +0530, rahul patwari a =C3=A9c= rit : > > Hi Etienne, > > I came to know about the work going on in Structured Streaming Spark > Runner from Apache Beam Wiki - Works in Progress. > I have contributed to BeamSql earlier. And I am working on supporting > PCollectionView in BeamSql. > > I would love to understand the Runner's side of Apache Beam and > contribute to the Structured Streaming Spark Runner. > > Can you please point me in the right direction? > > Thanks, > Rahul > > --000000000000781a4f0592d7dc92 Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable
Alexey and Etienne: I'm very happy to join the sync-up= meeting. Please forward the meeting info to me. I am based in California, = US and hopefully the time will work :).

Thanks,
Xinyu=

On Wed, Sep 18, 2019 at 6:39 AM Etienne Chauchot <echauchot@apache.org> wrote:
Hi Xinyu,

Thanks for offering h= elp ! My comments are inline:

Le vendredi 13 septe= mbre 2019 =C3=A0 12:16 -0700, Xinyu Liu a =C3=A9crit=C2=A0:
Hi, Etienne,

The slides are very informative! Thanks for sharing the details about= =C2=A0how the Beam API are mapped into Spark Structural Streaming.

Thanks !

We (LinkedIn) are al= so interested in trying the new SparkRunner to run Beam pipeine in batch, a= nd contribute to it too. From my understanding, seems the functionality on = batch side is mostly complete and covers quite a large percentage of the te= sts (a few missing pieces like state and timer in ParDo and SDF).

Correct, it passes 89% of the tests, bu= t there is more than SDF, state and timer missing, there is also ongoing en= coders work that I would like to commit/push before merging.

=
If so= , is it possible to merge the new runner sooner into master so it's muc= h easier for us to pull it in (we have an internal fork) and contribute bac= k?

Sure, see my other mail on t= his thread. As Alexey mentioned, please join the sync meeting we have, the = more the merrier !


Also curious about the scheme pa= rt in the runner. Seems we can leverage the schema-aware work in PCollectio= n and translate from Beam schema to Spark, so it can be optimized in the pl= anner layer. It will be great to hear back your plans on that.
<= /blockquote>

Well, it is not designed yet but, if you re= member my talk, we need to store beam windowing information with the data i= tself, so ending up having a dataset<WindowedValue> . One lead that w= as discussed is to store it as a Spark schema such as this:

<= /div>
1. field1: binary data for beam windowing information (cannot be = mapped to fields because beam windowing info is complex structure)

2. fields of data as defined in the Beam schema if there = is one


Congrats on this great work!
Thanks !

Best,

Etienne

Thanks,
Xinyu

On Wed, Sep 11, 2019 at 6:02 P= M Rui Wang <ruwan= g@google.com> wrote:
Hello Etienne,

Your slide mentioned= that streaming mode development is blocked because Spark lacks supporting = multiple-aggregations in its streaming mode but design is ongoing. Do you h= ave a link or something else to their design discussion/doc?

=

-Rui=C2=A0=C2=A0

On Wed, Sep 11, 2019 at 5:1= 0 PM Etienne Chauchot <echauchot@apache.org> wrote:
Hi R= ahul,
Sure, and great ! Thanks for proposing !
If you w= ant details, here is the presentation I did 30 mins ago at the apachecon. Y= ou will find the video on youtube shortly but in the meantime, here is my p= resentation slides.

And here is the structured str= eaming branch. I'll be happy to review your PRs, thanks !

Best
Etienne

Le mercredi= 11 septembre 2019 =C3=A0 16:37 +0530, rahul patwari a =C3=A9crit=C2=A0:
Hi Etienne= ,

I came to know about the work going on in Structured= Streaming Spark Runner from Apache Beam Wiki - Works in Progress.
= I have contributed to BeamSql earlier. And I am working on supporting PColl= ectionView in BeamSql.

I would love to understand the = Runner's side of Apache Beam and contribute=C2=A0to the Structured Stre= aming Spark Runner.

Can you please point me in the rig= ht direction?

Thanks,
Rahul
--000000000000781a4f0592d7dc92--