hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Elliot West <>
Subject Re: Adding update/delete to the hive-hcatalog-streaming API
Date Thu, 26 Mar 2015 22:58:24 GMT
Hi Alan,

Yes, this is precisely our situation. The issues I'm having with the
current API are that I cannot intercept the creation of the
OrcRecordUpdater to set the recordIdColumn in the AcidOutputFormat.Options
instance. Additionally, I cannot extend the TransactionBatch interface to
expose further operations and instead have to encode the operation type in
the row parameter and decode it in the RecordWriter implementation -
possible but not very elegant.

Regarding the merge functionality, is this a new suggested feature of Hive?
Is there a JIRA I could track?

Thanks - Elliot.

On Thursday, March 26, 2015, Alan Gates <> wrote:

> Are you saying that when the records arrive you don't know updates from
> inserts and you're already doing processing to determine that?  If so, this
> is exactly the case we'd like to hit with the merge functionality.
> If you're already scanning the existing ORC file and obtaining the unique
> identifier (which is the triple (rowid, txnid, bucketid)) and determining
> which records have changed then you could certainly change the streaming
> interface to enable passing the records through  in bulk (along with
> operation markers to indicate insert/update/delete) and write those in a
> delta file in one pass.
> Alan.
>   Elliot West
>  March 26, 2015 at 15:10
> Hi, thanks for your quick reply.
> I see your point, but in my case would I not have the required
> RecordIdentifiers available as I'm already reading the entire partition to
> determine which records have changed?  Admittedly Hive will not reveal
> the ROW__IDs to me but I assume (incorrectly perhaps) that these are in the
> underlying ORCFile and therefore can be read by my Cascading process. In
> this scenario (where I have the relevant recIds to hand) why would it then
> be necessary to perform additional scans when issuing the deletes/updates?
> Apologies if I'm misunderstanding something.
> Thanks - Elliot.
> On Thursday, March 26, 2015, Alan Gates <> wrote:
>   Alan Gates
>  March 26, 2015 at 14:48
>  The missing piece for adding update and delete to the streaming API is a
> primary key.  Updates and deletes in SQL work by scanning the table or
> partition where the record resides.  This is assumed to be ok since we are
> not supporting transactional workloads and thus update/deletes are assumed
> to be infrequent.  But a need to scan for each update or delete will not
> perform adequately in the streaming case.
> I've had a few discussions with others recently who are thinking of adding
> merge like functionality, where you would upload all changes to a temp
> table and then in one scan/transaction apply those changes.  This is a
> common way to handle these situations for data warehouses, and is much
> easier than adding a primary key concept to Hive.
> Alan.
>   Elliot West
>  March 26, 2015 at 14:08
> Hi,
> I'd like to ascertain if it might be possible to add 'update' and 'delete'
> operations to the hive-hcatalog-streaming API. I've been looking at the API
> with interest for the last week as it appears to have the potential to help
> with some general data processing patterns that are prevalent where I work.
> Ultimately, we continuously load large amounts of data into Hadoop which is
> partitioned by some time interval - usually hour, day, or month depending
> on the data size. However, the records that reside in this data can change.
> We often receive some new information that mutates part of an existing
> record already stored in a partition in HDFS. Typically the amount of
> mutations is very small compared to the number of records in each
> partitions.
> To handle this currently we re-read and re-write all partitions that could
> potentially be affected by new data. In practice a single hour's worth of
> new data can require the reading and writing of 1 month's worth of
> partitions. By storing the data in a transactional Hive table I believe
> that we can instead issue updates and deletes for only the affected rows.
> Although we do use Hive for analytics on this data, much of the processing
> that generates and consumes the data is performed using Cascading.
> Therefore I'd like to be able to read and write the data via an API which
> we'd aim to integrate into a Cascading Tap of some description. Our
> Cascading processes could determine the new, updated, and deleted records
> and then use the API to stream these changes to the transactional Hive
> table.
> We have most of this working in a proof of concept, but as
> hive-hcatalog-streaming does not expose the delete/update methods of the
> OrcRecordUpdater we've had to hack together something unpleasant based on
> the original API.
> As a first step I'd like to check if there is any appetite for adding such
> functionality to the API or if this goes against the original motivations
> of the project? If this suggestion sounds reasonable then I'd be keen to
> help move this forward.
> Thanks - Elliot.

View raw message