hive-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Elliot West (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HIVE-10165) Improve hive-hcatalog-streaming extensibility and support updates and deletes.
Date Thu, 09 Jul 2015 15:37:04 GMT

    [ https://issues.apache.org/jira/browse/HIVE-10165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14620692#comment-14620692
] 

Elliot West commented on HIVE-10165:
------------------------------------

Thanks [~ekoifman]. With regards to you observation, I agree that the use of locks is incorrect.
I followed the pattern in the existing Streaming API but or course that is concerned with
inserts only. Using [this reference|http://www.slideshare.net/Hadoop_Summit/adding-acid-transactions-inserts-updates-a]
I note that I should be using a semi-shared lock. I’d be grateful for any additional advice
you can give on when each lock type/target should be employed. A potential concern of mine
is that the system may not know the set of partitions when the transaction is initiated. In
this case would it suffice to use a lock with a broader scope (i.e. a table lock?), or should
I acquire additional locks each time I encounter a new partition?

As a side note, it appears as though the current locking documentation does not cover update/delete
scenarios or semi-shared locks. I'll volunteer to update these pages once I have a clearer
understanding of how these lock types apply to these operations and partitions:

* https://cwiki.apache.org/confluence/display/Hive/Locking
* https://cwiki.apache.org/confluence/display/Hive/Hive+Transactions

Finally, as this issue is now resolved, should I submit patches using additional JIRA issues
or reopen this one?

> Improve hive-hcatalog-streaming extensibility and support updates and deletes.
> ------------------------------------------------------------------------------
>
>                 Key: HIVE-10165
>                 URL: https://issues.apache.org/jira/browse/HIVE-10165
>             Project: Hive
>          Issue Type: Improvement
>          Components: HCatalog
>    Affects Versions: 1.2.0
>            Reporter: Elliot West
>            Assignee: Elliot West
>              Labels: TODOC2.0, streaming_api
>             Fix For: 2.0.0
>
>         Attachments: HIVE-10165.0.patch, HIVE-10165.10.patch, HIVE-10165.4.patch, HIVE-10165.5.patch,
HIVE-10165.6.patch, HIVE-10165.7.patch, HIVE-10165.9.patch, mutate-system-overview.png
>
>
> h3. Overview
> I'd like to extend the [hive-hcatalog-streaming|https://cwiki.apache.org/confluence/display/Hive/Streaming+Data+Ingest]
API so that it also supports the writing of record updates and deletes in addition to the
already supported inserts.
> h3. Motivation
> We have many Hadoop processes outside of Hive that merge changed facts into existing
datasets. Traditionally we achieve this by: reading in a ground-truth dataset and a modified
dataset, grouping by a key, sorting by a sequence and then applying a function to determine
inserted, updated, and deleted rows. However, in our current scheme we must rewrite all partitions
that may potentially contain changes. In practice the number of mutated records is very small
when compared with the records contained in a partition. This approach results in a number
of operational issues:
> * Excessive amount of write activity required for small data changes.
> * Downstream applications cannot robustly read these datasets while they are being updated.
> * Due to scale of the updates (hundreds or partitions) the scope for contention is high.

> I believe we can address this problem by instead writing only the changed records to
a Hive transactional table. This should drastically reduce the amount of data that we need
to write and also provide a means for managing concurrent access to the data. Our existing
merge processes can read and retain each record's {{ROW_ID}}/{{RecordIdentifier}} and pass
this through to an updated form of the hive-hcatalog-streaming API which will then have the
required data to perform an update or insert in a transactional manner. 
> h3. Benefits
> * Enables the creation of large-scale dataset merge processes  
> * Opens up Hive transactional functionality in an accessible manner to processes that
operate outside of Hive.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message