iceberg-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From GitBox <...@apache.org>
Subject [GitHub] [iceberg] openinx commented on issue #360: Spec: Add column equality delete files
Date Tue, 30 Jun 2020 04:47:30 GMT

openinx commented on issue #360:
URL: https://github.com/apache/iceberg/issues/360#issuecomment-651531457


   > I have been thinking of an upsert case, where we get an entire replacement row but
not the previous row values. 
   
   The `upsert` without the previous row values depends on the primary key specification,
otherwise an `UPSERT(1,2)` don't know which row should it change. The primary key will also
introduce other questions: 
   1.   whether the primary key columns should include the partition column and bucket column
? if not, then a table with (a,b) columns, a is the primary key and b is the partition key.
for an `UPSERT(1,2)` its old row could be in any partitions of the iceberg table, we will
need broadcast to all partition ? That seems resource wasting.  If sure, then it will limit
the usage, for example we may adjust the bucket policy based on the fields to be JOIN between
two table to avoid the massive data shuffle. So when adjust the bucket policy we will also
need to adjust the primary key to include the new bucket columns ? 
   2.  How to ensure the uniqueness in iceberg table ?  Within a given bucket,  two different
version of the same row may appear  in two different data files, then will we need the costly
 `JOIN` between data files (Previously, we only need to JOIN between data files and delete
files);  Within a given iceberg table,  if primary key don't include the partition column,
then two rows with the same primary key may appear in two different partition, the pk deduplication
is a problem too.
   
   In my opinion,  CDC events from RDBMS should always provide both old values and new values
(such as `row-level` binlog).  It will be good to handle this case correctly first. About
the other CDC events, such as `UPSERT` with only primary key, it seems need more consideration.
   
   > Using a unique ID for each CDC event could mitigate the problem, but duplicate events
from at-least-once processing would still incorrectly delete data rows. How would you avoid
this problem?
   
   For spark streaming, it only guarantee the `at-least-once` semantics. Seems hard to maintain
the consistent data in sink table unless we provide a identified global timestamp to indicate
the before-after order. Let me think more about this.  For flink, it provide the `exactly-once`
semantics, so it seems easier.
   
   > but I'm not sure it would enable a source to replay exactly the events that were received
-- after all, we're still separating inserts and deletes into separate files so we can't easily
reconstruct an upsert event.
   
   I think the mixed equality-deletion/positional-deletion you described seems hard to reconstruct
the correct event order.  The pure positional-deletion described in this [document](https://docs.google.com/document/d/1bBKDD4l-pQFXaMb4nOyVK-Sl3N2NTTG37uOCQx8rKVc/edit#heading=h.ljqc7bxmc6ej)
 could restore the original order.  
   
   >  I don't think that it makes a case for dynamic columns.
   
   Yes, I agree that we don't need to care about the dynamic columns now.  In the real CDC
events, it always provide all column values for a row.  Please ignore the propose about the
dynamic columns encoding.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


Mime
View raw message