From Alan Gates <>
Subject Re: Adding update/delete to the hive-hcatalog-streaming API
Date Thu, 26 Mar 2015 21:48:03 GMT
The missing piece for adding update and delete to the streaming API is a 
primary key.  Updates and deletes in SQL work by scanning the table or 
partition where the record resides.  This is assumed to be ok since we 
are not supporting transactional workloads and thus update/deletes are 
assumed to be infrequent.  But a need to scan for each update or delete 
will not perform adequately in the streaming case.

I've had a few discussions with others recently who are thinking of 
adding merge like functionality, where you would upload all changes to a 
temp table and then in one scan/transaction apply those changes.  This 
is a common way to handle these situations for data warehouses, and is 
much easier than adding a primary key concept to Hive.


> Elliot West <>
> March 26, 2015 at 14:08
> Hi,
> I'd like to ascertain if it might be possible to add 'update' and 
> 'delete' operations to the hive-hcatalog-streaming API. I've been 
> looking at the API with interest for the last week as it appears to 
> have the potential to help with some general data processing patterns 
> that are prevalent where I work. Ultimately, we continuously load 
> large amounts of data into Hadoop which is partitioned by some time 
> interval - usually hour, day, or month depending on the data size. 
> However, the records that reside in this data can change. We often 
> receive some new information that mutates part of an existing record 
> already stored in a partition in HDFS. Typically the amount of 
> mutations is very small compared to the number of records in each 
> partitions.
> To handle this currently we re-read and re-write all partitions that 
> could potentially be affected by new data. In practice a single hour's 
> worth of new data can require the reading and writing of 1 month's 
> worth of partitions. By storing the data in a transactional Hive table 
> I believe that we can instead issue updates and deletes for only the 
> affected rows. Although we do use Hive for analytics on this data, 
> much of the processing that generates and consumes the data is 
> performed using Cascading. Therefore I'd like to be able to read and 
> write the data via an API which we'd aim to integrate into a Cascading 
> Tap of some description. Our Cascading processes could determine the 
> new, updated, and deleted records and then use the API to stream these 
> changes to the transactional Hive table.
> We have most of this working in a proof of concept, but as 
> hive-hcatalog-streaming does not expose the delete/update methods of 
> the OrcRecordUpdater we've had to hack together something unpleasant 
> based on the original API.
> As a first step I'd like to check if there is any appetite for adding 
> such functionality to the API or if this goes against the original 
> motivations of the project? If this suggestion sounds reasonable then 
> I'd be keen to help move this forward.
> Thanks - Elliot.

