beam-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Reuven Lax <re...@google.com>
Subject Re: Pubsub to Beam SQL
Date Thu, 03 May 2018 19:59:30 GMT
I believe PubSubIO already exposes the publish timestamp if no timestamp
attribute is set.

On Thu, May 3, 2018 at 12:52 PM Anton Kedin <kedin@google.com> wrote:

> A SQL-specific wrapper+custom transforms for PubsubIO should suffice. We
> will probably need to a way to expose a message publish timestamp if we
> want to use it as an event timestamp, but that will be consumed by the same
> wrapper/transform without adding anything schema or SQL-specific to
> PubsubIO itself.
>
> On Thu, May 3, 2018 at 11:44 AM Reuven Lax <relax@google.com> wrote:
>
>> Are you planning on integrating this directly into PubSubIO, or add a
>> follow-on transform?
>>
>> On Wed, May 2, 2018 at 10:30 AM Anton Kedin <kedin@google.com> wrote:
>>
>>> Hi
>>>
>>> I am working on adding functionality to support querying Pubsub messages
>>> directly from Beam SQL.
>>>
>>> *Goal*
>>>   Provide Beam users a pure  SQL solution to create the pipelines with
>>> Pubsub as a data source, without the need to set up the pipelines in
>>> Java before applying the query.
>>>
>>> *High level approach*
>>>
>>>    -
>>>    - Build on top of PubsubIO;
>>>    - Pubsub source will be declared using CREATE TABLE DDL statement:
>>>       - Beam SQL already supports declaring sources like Kafka and Text
>>>       using CREATE TABLE DDL;
>>>       - it supports additional configuration using TBLPROPERTIES
>>>       clause. Currently it takes a text blob, where we can put a JSON
>>>       configuration;
>>>       - wrapping PubsubIO into a similar source looks feasible;
>>>    - The plan is to initially support messages only with JSON payload:
>>>    -
>>>       - more payload formats can be added later;
>>>    - Messages will be fully described in the CREATE TABLE statements:
>>>       - event timestamps. Source of the timestamp is configurable. It
>>>       is required by Beam SQL to have an explicit timestamp column for windowing
>>>       support;
>>>       - messages attributes map;
>>>       - JSON payload schema;
>>>    - Event timestamps will be taken either from publish time or
>>>    user-specified message attribute (configurable);
>>>
>>> Thoughts, ideas, comments?
>>>
>>> More details are in the doc here:
>>> https://docs.google.com/document/d/1wIXTxh-nQ3u694XbF0iEZX_7-b3yi4ad0ML2pcAxYfE
>>>
>>>
>>> Thank you,
>>> Anton
>>>
>>

Mime
View raw message