hive-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Elliot West (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (HIVE-10165) Improve hive-hcatalog-streaming extensibility and support updates and deletes.
Date Tue, 31 Mar 2015 14:09:52 GMT

     [ https://issues.apache.org/jira/browse/HIVE-10165?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Elliot West updated HIVE-10165:
-------------------------------
    Description: 
h3. Overview
I'd like to extend the [hive-hcatalog-streaming|https://cwiki.apache.org/confluence/display/Hive/Streaming+Data+Ingest]
API so that it also supports the writing of record updates and deletes in addition to the
already supported inserts.

h3. Motivation
We have many Hadoop processes outside of Hive that merge changed facts into existing datasets.
Traditionally we achieve this by: reading in a ground-truth dataset and a modified dataset,
grouping by a key, sorting by a sequence and then applying a function to determine inserted,
updated, and deleted rows. However, in our current scheme we must rewrite all partitions that
may potentially contain changes. In practice the number of mutated records is very small when
compared with the records contained in a partition. This approach results in a number of operational
issues:
* Excessive amount of write activity required for small data changes.
* Downstream applications cannot robustly read these datasets while they are being updated.
* Due to scale of the updates (hundreds or partitions) the scope for contention is high. 

I believe we can address this problem by instead writing only the changed records to a Hive
transactional table. This should drastically reduce the amount of data that we need to write
and also provide a means for managing concurrent access to the data. Our existing merge processes
can read and retain each record's {{ROW_ID}}/{{RecordIdentifier}} and pass this through to
an updated form of the hive-hcatalog-streaming API which will then have the required data
to perform an update or insert in a transactional manner. 

h3. Benefits
* Enables the creation of large-scale dataset merge processes  
* Opens up Hive transactional functionality in an accessible manner to processes that operate
outside of Hive.

h3. Implementation
We've patched the API to provide visibility to the underlying {{OrcRecordUpdater}} and allow
extension of the {{AbstractRecordWriter}} by third-parties outside of the package. We've also
updated the user facing interfaces to provide update and delete functionality. I've provided
the modifications as three incremental patches. Generally speaking, each patch makes the API
less backwards compatible but more consistent with respect to offering updates, deletes as
well as writes (inserts). Ideally I hope that all three patches have merit, but only the first
patch is absolutely necessary to enable the features we need on the API, and it does so in
a backwards compatible way. I'll summarise the contents of each patch:

h4. [^HIVE-10165.0.patch] - Required
This patch contains what we consider to be the minimum amount of changes required to allow
users to create {{RecordWriter}} subclasses that can insert, update, and  delete records.
These changes also maintain backwards compatibility at the expense of confusing the API a
little. Note that the row representation has be changed from {{byte[]}} to {{Object}}. Within
our data processing jobs our records are often available in a strongly typed and decoded form
such as a POJO or a Tuple object. Therefore is seems to make sense that we are able to pass
this through to the {{OrcRecordUpdater}} without having to go through a {{byte[]}} encoding
step. This of course still allows users to use {{byte[]}} if they wish.

h4. [^HIVE-10165.1.patch] - Nice to have
This patch builds on the changes made in the *required* patch and aims to make the API cleaner
and more consistent while accommodating updates and inserts. It also adds some logic to prevent
the user from submitting multiple operation types to a single {{TransactionBatch}} as we found
this creates data inconsistencies within the Hive table. This patch breaks backwards compatibility.

h4. [^HIVE-10165.2.patch] - Nomenclature
This final patch simply renames some of existing types to more accurately convey their increased
responsibilities. The API is no longer writing just new records, it is now also responsible
for writing operations that are applied to existing records. This patch breaks backwards compatibility.

h3. Example
I've attached simple typical usage of the API. This is not a patch and is intended as an illustration
only: [^ReflectiveOperationWriter.java]

h3. Known issues
I have not yet provided any unit tests for the extended functionality. I fully expect that
these are required and will work on these if these patches have merit.

  was:
h3. Overview
I'd like to extend the [hive-hcatalog-streaming|https://cwiki.apache.org/confluence/display/Hive/Streaming+Data+Ingest]
API so that it also supports the writing of record updates and deletes in addition to the
already supported inserts.

h3. Motivation
We have many Hadoop processes outside of Hive that merge changed facts into existing datasets.
Traditionally we achieve this by: reading in a ground-truth dataset and a modified dataset,
grouping by a key, sorting by a sequence and then applying a function to determine inserted,
updated, and deleted rows. However, in our current scheme we must rewrite all partitions that
may potentially contain changes. In practice the number of mutated records is very small when
compared with the records contained in a partition. This approach results in a number of operational
issues:
* Excessive amount of write activity required for small data changes.
* Downstream applications cannot robustly read these datasets while they are being updated.
* Due to scale of the updates (hundreds or partitions) the scope for contention is high. 

I believe we can address this problem by instead writing only the changed records to a Hive
transactional table. This should drastically reduce the amount of data that we need to write
and also provide a means for managing concurrent access to the data. Our existing merge processes
can read and retain each record's {{ROW_ID}}/{{RecordIdentifier}} and pass this through to
an updated form of the hive-hcatalog-streaming API which will then have the required data
to perform an update or insert in a transactional manner. 

h3. Benefits
* Enables the creation of large-scale dataset merge processes  
* Opens up Hive transactional functionality in an accessible manner to processes that operate
outside of Hive.

h3. Implementation
We've patched the API to provide visibility to the underlying {{OrcRecordUpdater}} and allow
extension of the {{AbstractRecordWriter}} by third-parties outside of the package. We've also
updated the user facing interfaces to provide update and delete functionality. I've provided
the modifications as three incremental patches. Generally speaking, each patch makes the API
less backwards compatible but more consistent with respect to offering updates, deletes as
well as writes (inserts). Ideally I hope that all three patches have merit, but only the first
patch is absolutely necessary to enable the features we need on the API, and it does so in
a backwards compatible way. I'll summarise the contents of each patch:

h4. [^HIVE-10165.0.patch] - Required
This patch contains what we consider to be the minimum amount of changes required to allow
users to create {{RecordWriter}} subclasses that can insert, update, and  delete records.
These changes also maintain backwards compatibility at the expense of confusing the API a
little. Note that the row representation has be changed from {{byte[]}} to {{Object}}. Within
our data processing jobs our records are often available in a strongly typed and decoded form
such as a POJO or a Tuple object. Therefore is seems to make sense that we are able to pass
this through to the {{OrcRecordUpdater}} without having to go through a {{byte[]}} encoding
step. This of course still allows users to use {{byte[]}} if they wish.

h4. [^HIVE-10165.1.patch] - Nice to have
This patch builds on the changes made in the *required* patch and aims to make the API cleaner
and more consistent while accommodating updates and inserts. It also adds some logic to prevent
the user from submitting multiple operation types to a single {{TransactionBatch}} as we found
this creates data inconsistencies within the Hive table. This patch breaks backwards compatibility.

h4. [^HIVE-10165.2.patch] - Nomenclature
This final patch simply renames some of existing types to more accurately convey their increased
responsibilities. The API is no longer writing just new records, it is now also responsible
for writing operations that are applied to existing records. This patch breaks backwards compatibility.

h3. Example
I've attached simple typical usage of the API. This is not a patch and is intended as an illustration
only: [^ReflectiveOperationWriter.java]

h3. Known issues
I have not yet provided any unit tests for the extended functionality. I fully expect that
these are required and will work on these if these patches have merit.

*Note: Attachments to follow.*


> Improve hive-hcatalog-streaming extensibility and support updates and deletes.
> ------------------------------------------------------------------------------
>
>                 Key: HIVE-10165
>                 URL: https://issues.apache.org/jira/browse/HIVE-10165
>             Project: Hive
>          Issue Type: Improvement
>          Components: HCatalog
>            Reporter: Elliot West
>            Assignee: Alan Gates
>              Labels: streaming_api
>             Fix For: 1.2.0
>
>         Attachments: HIVE-10165.0.patch, HIVE-10165.1.patch, HIVE-10165.2.patch, ReflectiveOperationWriter.java
>
>
> h3. Overview
> I'd like to extend the [hive-hcatalog-streaming|https://cwiki.apache.org/confluence/display/Hive/Streaming+Data+Ingest]
API so that it also supports the writing of record updates and deletes in addition to the
already supported inserts.
> h3. Motivation
> We have many Hadoop processes outside of Hive that merge changed facts into existing
datasets. Traditionally we achieve this by: reading in a ground-truth dataset and a modified
dataset, grouping by a key, sorting by a sequence and then applying a function to determine
inserted, updated, and deleted rows. However, in our current scheme we must rewrite all partitions
that may potentially contain changes. In practice the number of mutated records is very small
when compared with the records contained in a partition. This approach results in a number
of operational issues:
> * Excessive amount of write activity required for small data changes.
> * Downstream applications cannot robustly read these datasets while they are being updated.
> * Due to scale of the updates (hundreds or partitions) the scope for contention is high.

> I believe we can address this problem by instead writing only the changed records to
a Hive transactional table. This should drastically reduce the amount of data that we need
to write and also provide a means for managing concurrent access to the data. Our existing
merge processes can read and retain each record's {{ROW_ID}}/{{RecordIdentifier}} and pass
this through to an updated form of the hive-hcatalog-streaming API which will then have the
required data to perform an update or insert in a transactional manner. 
> h3. Benefits
> * Enables the creation of large-scale dataset merge processes  
> * Opens up Hive transactional functionality in an accessible manner to processes that
operate outside of Hive.
> h3. Implementation
> We've patched the API to provide visibility to the underlying {{OrcRecordUpdater}} and
allow extension of the {{AbstractRecordWriter}} by third-parties outside of the package. We've
also updated the user facing interfaces to provide update and delete functionality. I've provided
the modifications as three incremental patches. Generally speaking, each patch makes the API
less backwards compatible but more consistent with respect to offering updates, deletes as
well as writes (inserts). Ideally I hope that all three patches have merit, but only the first
patch is absolutely necessary to enable the features we need on the API, and it does so in
a backwards compatible way. I'll summarise the contents of each patch:
> h4. [^HIVE-10165.0.patch] - Required
> This patch contains what we consider to be the minimum amount of changes required to
allow users to create {{RecordWriter}} subclasses that can insert, update, and  delete records.
These changes also maintain backwards compatibility at the expense of confusing the API a
little. Note that the row representation has be changed from {{byte[]}} to {{Object}}. Within
our data processing jobs our records are often available in a strongly typed and decoded form
such as a POJO or a Tuple object. Therefore is seems to make sense that we are able to pass
this through to the {{OrcRecordUpdater}} without having to go through a {{byte[]}} encoding
step. This of course still allows users to use {{byte[]}} if they wish.
> h4. [^HIVE-10165.1.patch] - Nice to have
> This patch builds on the changes made in the *required* patch and aims to make the API
cleaner and more consistent while accommodating updates and inserts. It also adds some logic
to prevent the user from submitting multiple operation types to a single {{TransactionBatch}}
as we found this creates data inconsistencies within the Hive table. This patch breaks backwards
compatibility.
> h4. [^HIVE-10165.2.patch] - Nomenclature
> This final patch simply renames some of existing types to more accurately convey their
increased responsibilities. The API is no longer writing just new records, it is now also
responsible for writing operations that are applied to existing records. This patch breaks
backwards compatibility.
> h3. Example
> I've attached simple typical usage of the API. This is not a patch and is intended as
an illustration only: [^ReflectiveOperationWriter.java]
> h3. Known issues
> I have not yet provided any unit tests for the extended functionality. I fully expect
that these are required and will work on these if these patches have merit.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message