hive-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Carter Shanklin (JIRA)" <j...@apache.org>
Subject [jira] [Created] (HIVE-16258) Suggesting a non-standard extension to MERGE
Date Mon, 20 Mar 2017 20:13:42 GMT
Carter Shanklin created HIVE-16258:
--------------------------------------

             Summary: Suggesting a non-standard extension to MERGE
                 Key: HIVE-16258
                 URL: https://issues.apache.org/jira/browse/HIVE-16258
             Project: Hive
          Issue Type: Improvement
            Reporter: Carter Shanklin


Some common data maintenance strategies, especially the Type 2 SCD update, would become substantially
easier with a small extension to the SQL standard for MERGE, specifically the ability to say
"when matched then insert". Per the standard, matched records can only be updated or deleted.

In the Type 2 SCD, when a new record comes in you update the old version of the record and
insert the new version of the same record. If this extension were supported, sample Type 2
SCD code would look as follows:

{code}
merge into customer
using new_customer_stage stage
on stage.source_pk = customer.source_pk
when not matched then insert values    /* Insert a net new record */
  (stage.source_pk, upper(substr(stage.name, 0, 3)), stage.name, stage.state, true, null)
when matched then update set           /* Update an old record to mark it as out-of-date */
  is_current = false, end_date = current_date()
when matched then insert values        /* Insert a new current record */
  (stage.source_pk, upper(substr(stage.name, 0, 3)), stage.name, stage.state, true, null);
{code}

Without this support, the user needs to devise some sort of workaround. A common approach
is to first left join the staging table against the table to be updated, then to join these
results to a helper table that will spit out two records for each match and one record for
each miss. One of the matching records needs to have a join key that can never occur in the
source data so this requires precise knowledge of the source dataset.

An example of this:
{code}
merge into customer
using (
  select
    *,
    coalesce(invalid_key, source_pk) as join_key
  from (
    select
      stage.source_pk, stage.name, stage.state,
      case when customer.source_pk is null then 1
      when stage.name <> customer.name or stage.state <> customer.state then 2
      else 0 end as scd_row_type
    from
      new_customer_stage stage
    left join
      customer
    on (stage.source_pk = customer.source_pk and customer.is_current = true)
  ) updates
  join scd_types on scd_types.type = scd_row_type
) sub
on sub.join_key = customer.source_pk
when matched then update set
  is_current = false,
  end_date = current_date()
when not matched then insert values
  (sub.source_pk, upper(substr(sub.name, 0, 3)), sub.name, sub.state, true, null);
select * from customer order by source_pk;
{code}

This code is very complicated and will fail if the "invalid" key ever shows up in the source
dataset. This simple extension provides a lot of value and likely very little maintenance
overhead.

/cc [~ekoifman]



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Mime
View raw message