hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Alan Gates <alanfga...@gmail.com>
Subject Re: Adding update/delete to the hive-hcatalog-streaming API
Date Thu, 26 Mar 2015 23:20:29 GMT


> Elliot West <mailto:teabot@gmail.com>
> March 26, 2015 at 15:58
> Hi Alan,
>
> Yes, this is precisely our situation. The issues I'm having with the 
> current API are that I cannot intercept the creation of the 
> OrcRecordUpdater to set the recordIdColumn in 
> the AcidOutputFormat.Options instance. Additionally, I cannot extend 
> the TransactionBatch interface to expose further operations and 
> instead have to encode the operation type in the row parameter and 
> decode it in the RecordWriter implementation - possible but not very 
> elegant.
Yes, you'd have to make changes in the API itself to make this possible, 
which is fine.  I'm happy to review patches for this.
>
> Regarding the merge functionality, is this a new suggested feature of 
> Hive? Is there a JIRA I could track?
Not yet, it's just discussions I've had with users like you who'd be 
interested in merge, though from the SQL side rather than streaming.  I 
am not aware of anyone working on it at the moment.

Alan.
>
> Thanks - Elliot.
>
>
> On Thursday, March 26, 2015, Alan Gates <alanfgates@gmail.com 
> <mailto:alanfgates@gmail.com>> wrote:
> Alan Gates <mailto:alanfgates@gmail.com>
> March 26, 2015 at 15:30
> Are you saying that when the records arrive you don't know updates 
> from inserts and you're already doing processing to determine that?  
> If so, this is exactly the case we'd like to hit with the merge 
> functionality.
>
> If you're already scanning the existing ORC file and obtaining the 
> unique identifier (which is the triple (rowid, txnid, bucketid)) and 
> determining which records have changed then you could certainly change 
> the streaming interface to enable passing the records through  in bulk 
> (along with operation markers to indicate insert/update/delete) and 
> write those in a delta file in one pass.
>
> Alan.
>
> Elliot West <mailto:teabot@gmail.com>
> March 26, 2015 at 15:10
> Hi, thanks for your quick reply.
>
> I see your point, but in my case would I not have the required 
> RecordIdentifiers available as I'm already reading the entire 
> partition to determine which records have changed?  Admittedly Hive 
> will not reveal the ROW__IDs to me but I assume (incorrectly perhaps) 
> that these are in the underlying ORCFile and therefore can be read by 
> my Cascading process. In this scenario (where I have the relevant 
> recIds to hand) why would it then be necessary to perform additional 
> scans when issuing the deletes/updates?
>
> Apologies if I'm misunderstanding something.
>
> Thanks - Elliot.
>
> On Thursday, March 26, 2015, Alan Gates <alanfgates@gmail.com 
> <mailto:alanfgates@gmail.com>> wrote:
> Alan Gates <mailto:alanfgates@gmail.com>
> March 26, 2015 at 14:48
> The missing piece for adding update and delete to the streaming API is 
> a primary key.  Updates and deletes in SQL work by scanning the table 
> or partition where the record resides.  This is assumed to be ok since 
> we are not supporting transactional workloads and thus update/deletes 
> are assumed to be infrequent.  But a need to scan for each update or 
> delete will not perform adequately in the streaming case.
>
> I've had a few discussions with others recently who are thinking of 
> adding merge like functionality, where you would upload all changes to 
> a temp table and then in one scan/transaction apply those changes.  
> This is a common way to handle these situations for data warehouses, 
> and is much easier than adding a primary key concept to Hive.
>
> Alan.
>
> Elliot West <mailto:teabot@gmail.com>
> March 26, 2015 at 14:08
> Hi,
>
> I'd like to ascertain if it might be possible to add 'update' and 
> 'delete' operations to the hive-hcatalog-streaming API. I've been 
> looking at the API with interest for the last week as it appears to 
> have the potential to help with some general data processing patterns 
> that are prevalent where I work. Ultimately, we continuously load 
> large amounts of data into Hadoop which is partitioned by some time 
> interval - usually hour, day, or month depending on the data size. 
> However, the records that reside in this data can change. We often 
> receive some new information that mutates part of an existing record 
> already stored in a partition in HDFS. Typically the amount of 
> mutations is very small compared to the number of records in each 
> partitions.
>
> To handle this currently we re-read and re-write all partitions that 
> could potentially be affected by new data. In practice a single hour's 
> worth of new data can require the reading and writing of 1 month's 
> worth of partitions. By storing the data in a transactional Hive table 
> I believe that we can instead issue updates and deletes for only the 
> affected rows. Although we do use Hive for analytics on this data, 
> much of the processing that generates and consumes the data is 
> performed using Cascading. Therefore I'd like to be able to read and 
> write the data via an API which we'd aim to integrate into a Cascading 
> Tap of some description. Our Cascading processes could determine the 
> new, updated, and deleted records and then use the API to stream these 
> changes to the transactional Hive table.
>
> We have most of this working in a proof of concept, but as 
> hive-hcatalog-streaming does not expose the delete/update methods of 
> the OrcRecordUpdater we've had to hack together something unpleasant 
> based on the original API.
>
> As a first step I'd like to check if there is any appetite for adding 
> such functionality to the API or if this goes against the original 
> motivations of the project? If this suggestion sounds reasonable then 
> I'd be keen to help move this forward.
>
> Thanks - Elliot.

Mime
View raw message