hive-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Elliot West (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (HIVE-10165) Improve hive-hcatalog-streaming extensibility and support updates and deletes.
Date Tue, 31 Mar 2015 14:01:12 GMT

     [ https://issues.apache.org/jira/browse/HIVE-10165?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Elliot West updated HIVE-10165:
-------------------------------
    Attachment: ReflectiveOperationWriter.java

> Improve hive-hcatalog-streaming extensibility and support updates and deletes.
> ------------------------------------------------------------------------------
>
>                 Key: HIVE-10165
>                 URL: https://issues.apache.org/jira/browse/HIVE-10165
>             Project: Hive
>          Issue Type: Improvement
>          Components: HCatalog
>            Reporter: Elliot West
>            Assignee: Alan Gates
>              Labels: streaming_api
>             Fix For: 1.2.0
>
>         Attachments: HIVE-10165.0.patch, HIVE-10165.1.patch, HIVE-10165.2.patch, ReflectiveOperationWriter.java
>
>
> h3. Overview
> I'd like to extend the [hive-hcatalog-streaming|https://cwiki.apache.org/confluence/display/Hive/Streaming+Data+Ingest]
API so that it also supports the writing of record updates and deletes in addition to the
already supported inserts.
> h3. Motivation
> We have many Hadoop processes outside of Hive that merge changed facts into existing
datasets. Traditionally we achieve this by: reading in a ground-truth dataset and a modified
dataset, grouping by a key, sorting by a sequence and then applying a function to determine
inserted, updated, and deleted rows. However, in our current scheme we must rewrite all partitions
that may potentially contain changes. In practice the number of mutated records is very small
when compared with the records contained in a partition. This approach results in a number
of operational issues:
> * Excessive amount of write activity required for small data changes.
> * Downstream applications cannot robustly read these datasets while they are being updated.
> * Due to scale of the updates (hundreds or partitions) the scope for contention is high.

> I believe we can address this problem by instead writing only the changed records to
a Hive transactional table. This should drastically reduce the amount of data that we need
to write and also provide a means for managing concurrent access to the data. Our existing
merge processes can read and retain each record's {{ROW_ID}}/{{RecordIdentifier}} and pass
this through to an updated form of the hive-hcatalog-streaming API which will then have the
required data to perform an update or insert in a transactional manner. 
> h3. Benefits
> * Enables the creation of large-scale dataset merge processes  
> * Opens up Hive transactional functionality in an accessible manner to processes that
operate outside of Hive.
> h3. Implementation
> We've patched the API to provide visibility to the underlying {{OrcRecordUpdater}} and
allow extension of the {{AbstractRecordWriter}} by third-parties outside of the package. We've
also updated the user facing interfaces to provide update and delete functionality. I've provided
the modifications as three incremental patches. Generally speaking, each patch makes the API
less backwards compatible but more consistent with respect to offering updates, deletes as
well as writes (inserts). Ideally I hope that all three patches have merit, but only the first
patch is absolutely necessary to enable the features we need on the API, and it does so in
a backwards compatible way. I'll summarise the contents of each patch:
> h4. [^HIVE-10165.0.patch] - Required
> This patch contains what we consider to be the minimum amount of changes required to
allow users to create {{RecordWriter}} subclasses that can insert, update, and  delete records.
These changes also maintain backwards compatibility at the expense of confusing the API a
little. Note that the row representation has be changed from {{byte[]}} to {{Object}}. Within
our data processing jobs our records are often available in a strongly typed and decoded form
such as a POJO or a Tuple object. Therefore is seems to make sense that we are able to pass
this through to the {{OrcRecordUpdater}} without having to go through a {{byte[]}} encoding
step. This our course still allows users to use {{byte[]}} if they wish.
> h4. [^HIVE-10165.1.patch] - Nice to have
> This patch builds on the changes made in the *required* patch and aims to make the API
cleaner and more consistent while accommodating updates and inserts. It also adds some logic
to prevent the user from submitting multiple operation types to a single {{TransactionBatch}}
as we found this creates data inconsistencies within the Hive table. This patch breaks backwards
compatibility.
> h4. [^HIVE-10165.2.patch] - Nomenclature
> This final patch simply renames some of existing types to more accurately convey their
increased responsibilities. The API is no longer writing just new records, it is now also
responsible for writing operations that are applied to existing records. This patch breaks
backwards compatibility.
> h3. Example
> I've attached simple typical usage of the API. This is not a patch and is intended as
an illustration only.
> h3. Known issues
> I have not yet provided any unit tests for the extended functionality. I fully expect
that these are required and will work on these if these patches have merit.
> *Note: Attachments to follow.*



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message