hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Gopal Vijayaraghavan <gop...@apache.org>
Subject Re: on duplicate update equivalent?
Date Fri, 23 Sep 2016 18:36:33 GMT
> Dimensions change, and I'd rather do update than recreate a snapshot.

Slow changing dimensions are the common use-case for Hive's ACID MERGE.

The feature you need is most likely covered by 

https://issues.apache.org/jira/browse/HIVE-10924

2nd comment from that JIRA

"Once an hour, a set of inserts and updates (up to 500k rows) for various dimension tables
(eg. customer, inventory, stores) needs to be processed. The dimension tables have primary
keys and are typically bucketed and sorted on those keys."

Any other approach would need a full snapshot re-materialization, because ACID can generate
DELETE + INSERT instead of rewriting the original file for a 2% upsert.

If you do not have any isolation concerns (as in, a query doing a read when 50% of your update
has applied), using HBase backed dimension tables in Hive is possible, but it does not offer
the same transactional consistency as the ACID merge will.

Cheers,
Gopal



Mime
View raw message