beam-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Tyler Akidau <taki...@google.com>
Subject Re: SQL in Stream Computing: MERGE or INSERT?
Date Thu, 22 Jun 2017 15:35:00 GMT
Calcite appears to have UPSERT
<https://issues.apache.org/jira/browse/CALCITE-492> support, can we just
use that instead?

Also, I don't understand your statement that "INSERT will behave
differently in batch & stream processing". Can you explain further?

-Tyler


On Thu, Jun 22, 2017 at 7:35 AM Jesse Anderson <jesse@bigdatainstitute.io>
wrote:

> If I'm understanding correctly, Hive does that with a insert into followed
> by a select statement that does the aggregation.
> https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DML#LanguageManualDML-InsertingdataintoHiveTablesfromqueries
>
> On Thu, Jun 22, 2017 at 1:32 AM James <xumingmingv@gmail.com> wrote:
>
>> Hi team,
>>
>> I am thinking about a SQL and stream computing related problem, want to
>> hear your opinions.
>>
>> In stream computing, there is a typical case like this:
>>
>> *We want to calculate a big wide result table, which has one rowkey and
>> ten
>> value columns:*
>> *create table result (*
>> *    rowkey varchar(127) PRIMARY KEY,*
>> *    col1 int,*
>> *    col2 int,*
>> *    ...*
>> *    col10 int*
>> *);*
>>
>> Each of the value columns is calculated by a complex query, so there will
>> be ten SQLs to calculate
>> data for this table, for each sql:
>>
>> * First check whether there is a row for the specified `rowkey`.
>> * If yes, then `update`, otherwise `insert`.
>>
>> There is actually a dedicated sql syntax called `MERGE` designed for
>> this(SQL2008), a sample usage is:
>>
>> MERGE INTO result D
>>    USING (SELECT rowkey, col1 FROM input WHERE flag = 80) S
>>    ON (D.rowkey = S.rowkey)
>>    WHEN MATCHED THEN UPDATE SET D.col1 = S.col1
>>    WHEN NOT MATCHED THEN INSERT (D.rowkey, D.col1)
>>
>>
>> *The semantic fits perfectly, but it is very verbose, and normal users
>> rarely used this syntax.*
>>
>> So my colleagues invented a new syntax for this scenario (Or more
>> precisely, a new interpretation for the INSERT statement). For the above
>> scenario, user will always write `insert` statement:
>>
>> insert into result(rowkey, col1) values(...)
>> insert into result(rowkey, col2) values(...)
>>
>> The sql interpreter will do a trick behind the scene: if the `rowkey`
>> exists, then update, otherwise `insert`. This solution is very concise,
>> but
>> violates the semantics of `insert`, using this solution INSERT will behave
>> differently in batch & stream processing.
>>
>> How do you guys think? which do you prefer? What's your reasoning?
>>
>> Looking forward to your opinions, thanks in advance.
>>
> --
> Thanks,
>
> Jesse
>

Mime
View raw message