hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mich Talebzadeh <mich.talebza...@gmail.com>
Subject Re: on duplicate update equivalent?
Date Fri, 23 Sep 2016 19:00:12 GMT
The fundamental question is: do you need these recurring updates to
dimension tables throttling your Hive tables.

Besides why bother with ETL when one can do ELT.

For dimension table just add two additional columns namely

   , op_type int
   , op_time timestamp

op_type = 1/2/3 (INSERT/UPDATE/DELETE)  and op_time = timestamp from Hive
to the original table. New records will be appended to the dimension table.
So when you have the full Entity Life History (one INSERT, multiple
UPDATES and one delete) for a given primary key. then you can do whatever
you want plus of course full audit of every record (for example what
happened to every trade, who changed what etc).

In your join with the FACT table you will need to use analytics to find the
last entry for a given primary key (ignoring deletes) or just use standard
HQL.

If you are going to bring in Hbase etc to it, then Spark solution that I
suggested earlier on may serve better.

HTH





Dr Mich Talebzadeh



LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 23 September 2016 at 19:36, Gopal Vijayaraghavan <gopalv@apache.org>
wrote:

> > Dimensions change, and I'd rather do update than recreate a snapshot.
>
> Slow changing dimensions are the common use-case for Hive's ACID MERGE.
>
> The feature you need is most likely covered by
>
> https://issues.apache.org/jira/browse/HIVE-10924
>
> 2nd comment from that JIRA
>
> "Once an hour, a set of inserts and updates (up to 500k rows) for various
> dimension tables (eg. customer, inventory, stores) needs to be processed.
> The dimension tables have primary keys and are typically bucketed and
> sorted on those keys."
>
> Any other approach would need a full snapshot re-materialization, because
> ACID can generate DELETE + INSERT instead of rewriting the original file
> for a 2% upsert.
>
> If you do not have any isolation concerns (as in, a query doing a read
> when 50% of your update has applied), using HBase backed dimension tables
> in Hive is possible, but it does not offer the same transactional
> consistency as the ACID merge will.
>
> Cheers,
> Gopal
>
>
>

Mime
View raw message