Return-Path: X-Original-To: apmail-hive-issues-archive@minotaur.apache.org Delivered-To: apmail-hive-issues-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id C195C18CF1 for ; Tue, 30 Jun 2015 16:52:04 +0000 (UTC) Received: (qmail 45785 invoked by uid 500); 30 Jun 2015 16:52:04 -0000 Delivered-To: apmail-hive-issues-archive@hive.apache.org Received: (qmail 45761 invoked by uid 500); 30 Jun 2015 16:52:04 -0000 Mailing-List: contact issues-help@hive.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@hive.apache.org Delivered-To: mailing list issues@hive.apache.org Received: (qmail 45751 invoked by uid 99); 30 Jun 2015 16:52:04 -0000 Received: from arcas.apache.org (HELO arcas.apache.org) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 30 Jun 2015 16:52:04 +0000 Date: Tue, 30 Jun 2015 16:52:04 +0000 (UTC) From: "Hive QA (JIRA)" To: issues@hive.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (HIVE-10165) Improve hive-hcatalog-streaming extensibility and support updates and deletes. MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/HIVE-10165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14608641#comment-14608641 ] Hive QA commented on HIVE-10165: -------------------------------- {color:green}Overall{color}: +1 all checks pass Here are the results of testing the latest attachment: https://issues.apache.org/jira/secure/attachment/12742837/HIVE-10165.10.patch {color:green}SUCCESS:{color} +1 9128 tests passed Test results: http://ec2-174-129-184-35.compute-1.amazonaws.com/jenkins/job/PreCommit-HIVE-TRUNK-Build/4445/testReport Console output: http://ec2-174-129-184-35.compute-1.amazonaws.com/jenkins/job/PreCommit-HIVE-TRUNK-Build/4445/console Test logs: http://ec2-174-129-184-35.compute-1.amazonaws.com/logs/PreCommit-HIVE-TRUNK-Build-4445/ Messages: {noformat} Executing org.apache.hive.ptest.execution.PrepPhase Executing org.apache.hive.ptest.execution.ExecutionPhase Executing org.apache.hive.ptest.execution.ReportingPhase {noformat} This message is automatically generated. ATTACHMENT ID: 12742837 - PreCommit-HIVE-TRUNK-Build > Improve hive-hcatalog-streaming extensibility and support updates and deletes. > ------------------------------------------------------------------------------ > > Key: HIVE-10165 > URL: https://issues.apache.org/jira/browse/HIVE-10165 > Project: Hive > Issue Type: Improvement > Components: HCatalog > Affects Versions: 1.2.0 > Reporter: Elliot West > Assignee: Elliot West > Labels: streaming_api > Attachments: HIVE-10165.0.patch, HIVE-10165.10.patch, HIVE-10165.4.patch, HIVE-10165.5.patch, HIVE-10165.6.patch, HIVE-10165.7.patch, HIVE-10165.9.patch, mutate-system-overview.png > > > h3. Overview > I'd like to extend the [hive-hcatalog-streaming|https://cwiki.apache.org/confluence/display/Hive/Streaming+Data+Ingest] API so that it also supports the writing of record updates and deletes in addition to the already supported inserts. > h3. Motivation > We have many Hadoop processes outside of Hive that merge changed facts into existing datasets. Traditionally we achieve this by: reading in a ground-truth dataset and a modified dataset, grouping by a key, sorting by a sequence and then applying a function to determine inserted, updated, and deleted rows. However, in our current scheme we must rewrite all partitions that may potentially contain changes. In practice the number of mutated records is very small when compared with the records contained in a partition. This approach results in a number of operational issues: > * Excessive amount of write activity required for small data changes. > * Downstream applications cannot robustly read these datasets while they are being updated. > * Due to scale of the updates (hundreds or partitions) the scope for contention is high. > I believe we can address this problem by instead writing only the changed records to a Hive transactional table. This should drastically reduce the amount of data that we need to write and also provide a means for managing concurrent access to the data. Our existing merge processes can read and retain each record's {{ROW_ID}}/{{RecordIdentifier}} and pass this through to an updated form of the hive-hcatalog-streaming API which will then have the required data to perform an update or insert in a transactional manner. > h3. Benefits > * Enables the creation of large-scale dataset merge processes > * Opens up Hive transactional functionality in an accessible manner to processes that operate outside of Hive. -- This message was sent by Atlassian JIRA (v6.3.4#6332)