hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Mich Talebzadeh" <>
Subject RE: Adding update/delete to the hive-hcatalog-streaming API
Date Thu, 26 Mar 2015 22:05:27 GMT


The primary key is required for updates/deletes to uniquely identify the record that needs
to be updated otherwise you are going to have unpredictable results as you may be updating
too many or deleting too many.


This is indeed a requirement for real time delivery of data to data warehouses (DW) such as
Sybase IQ using replication server. The replicate database (in this case DW) does not need
to have that primary key. The replication definition for that end table will have to have
a primary key or unique index as part of replication definition.


If the idea is to get data from RDBMS (read ACID compliant database) to Hive real time, I
would prefer this unique key to be built in in replication definition. Hive does not have
to have unique key as any data getting to Hive from RDBMS is expected to be transactional
and there is no need to do sanity check in Hive. What you are mentioning as temp is (I gather
if I am correct)  is eferred to staging in DW. However, there are now requirement for DW to
receive replicate data from transactional databases through SAP replication server or Oracle
Golden Gate.




Mich Talebzadeh


Publications due shortly:

Creating in-memory Data Grid for Trading Systems with Oracle TimesTen and Coherence Cache


NOTE: The information in this email is proprietary and confidential. This message is for the
designated recipient only, if you are not the intended recipient, you should destroy it immediately.
Any information in this message shall not be understood as given or endorsed by Peridale Ltd,
its subsidiaries or their employees, unless expressly so stated. It is the responsibility
of the recipient to ensure that this email is virus free, therefore neither Peridale Ltd,
its subsidiaries nor their employees accept any responsibility.


From: Alan Gates [] 
Sent: 26 March 2015 21:48
Subject: Re: Adding update/delete to the hive-hcatalog-streaming API


The missing piece for adding update and delete to the streaming API is a primary key.  Updates
and deletes in SQL work by scanning the table or partition where the record resides.  This
is assumed to be ok since we are not supporting transactional workloads and thus update/deletes
are assumed to be infrequent.  But a need to scan for each update or delete will not perform
adequately in the streaming case.

I've had a few discussions with others recently who are thinking of adding merge like functionality,
where you would upload all changes to a temp table and then in one scan/transaction apply
those changes.  This is a common way to handle these situations for data warehouses, and is
much easier than adding a primary key concept to Hive.


 <> Elliot West

March 26, 2015 at 14:08



I'd like to ascertain if it might be possible to add 'update' and 'delete' operations to the
hive-hcatalog-streaming API. I've been looking at the API with interest for the last week
as it appears to have the potential to help with some general data processing patterns that
are prevalent where I work. Ultimately, we continuously load large amounts of data into Hadoop
which is partitioned by some time interval - usually hour, day, or month depending on the
data size. However, the records that reside in this data can change. We often receive some
new information that mutates part of an existing record already stored in a partition in HDFS.
Typically the amount of mutations is very small compared to the number of records in each


To handle this currently we re-read and re-write all partitions that could potentially be
affected by new data. In practice a single hour's worth of new data can require the reading
and writing of 1 month's worth of partitions. By storing the data in a transactional Hive
table I believe that we can instead issue updates and deletes for only the affected rows.
Although we do use Hive for analytics on this data, much of the processing that generates
and consumes the data is performed using Cascading. Therefore I'd like to be able to read
and write the data via an API which we'd aim to integrate into a Cascading Tap of some description.
Our Cascading processes could determine the new, updated, and deleted records and then use
the API to stream these changes to the transactional Hive table.


We have most of this working in a proof of concept, but as hive-hcatalog-streaming does not
expose the delete/update methods of the OrcRecordUpdater we've had to hack together something
unpleasant based on the original API.


As a first step I'd like to check if there is any appetite for adding such functionality to
the API or if this goes against the original motivations of the project? If this suggestion
sounds reasonable then I'd be keen to help move this forward.


Thanks - Elliot.


View raw message