Return-Path: X-Original-To: apmail-hive-dev-archive@www.apache.org Delivered-To: apmail-hive-dev-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 9AF12196F9 for ; Thu, 24 Mar 2016 03:15:26 +0000 (UTC) Received: (qmail 65135 invoked by uid 500); 24 Mar 2016 03:15:26 -0000 Delivered-To: apmail-hive-dev-archive@hive.apache.org Received: (qmail 64983 invoked by uid 500); 24 Mar 2016 03:15:26 -0000 Mailing-List: contact dev-help@hive.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@hive.apache.org Delivered-To: mailing list dev@hive.apache.org Received: (qmail 64933 invoked by uid 99); 24 Mar 2016 03:15:26 -0000 Received: from arcas.apache.org (HELO arcas) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 24 Mar 2016 03:15:26 +0000 Received: from arcas.apache.org (localhost [127.0.0.1]) by arcas (Postfix) with ESMTP id C43512C1F5D for ; Thu, 24 Mar 2016 03:15:25 +0000 (UTC) Date: Thu, 24 Mar 2016 03:15:25 +0000 (UTC) From: "Sushanth Sowmyan (JIRA)" To: dev@hive.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Created] (HIVE-13348) Add Event Nullification support for Replication MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 Sushanth Sowmyan created HIVE-13348: --------------------------------------- Summary: Add Event Nullification support for Replication Key: HIVE-13348 URL: https://issues.apache.org/jira/browse/HIVE-13348 Project: Hive Issue Type: Sub-task Reporter: Sushanth Sowmyan Replication, as implemented by HIVE-7973 works as follows: a) For every singly modification to the hive metastore, an event gets triggered that logs a notification object. b) Replication tools such as falcon can consume these notification objects as a HCatReplicationTaskIterator from HCatClient.getReplicationTasks(lastEventId, maxEvents, dbName, tableName). c) For each event, we generate statements and distcp requirements for falcon to export, distcp and import to do the replication (along with requisite changes to export and import that would allow state management). The big thing missing from this picture is that while it works, it is pretty dumb about how it works in that it will exhaustively process every single event generated, and will try to do the export-distcp-import cycle for all modifications, irrespective of whether or not that will actually get used at import time. We need to build some sort of filtering logic which can process a batch of events to identify events that will result in effective no-ops, and to nullify those events from the stream before passing them on. The goal is to minimize the number of events that the tools like Falcon would actually have to process. Examples of cases where event nullification would take place: a) CREATE-DROP cases: If an object is being created in event#34 that will eventually get dropped in event#47, then there is no point in replicating this along. We simply null out both these events, and also, any other event that references this object between event#34 and event#47. b) APPEND-APPEND : Some objects are replicated wholesale, which means every APPEND that occurs would cause a full export of the object in question. At this point, the prior APPENDS would all be supplanted by the last APPEND. Thus, we could nullify all the prior such events. Additional such cases can be inferred by analysis of the Export-Import relay protocol definition at https://issues.apache.org/jira/secure/attachment/12725999/EXIMReplicationReplayProtocol.pdf or by reasoning out various event processing orders possible. Replication, as implemented by HIVE-7973 is merely a first step for functional support. This work is needed for replication to be efficient at all, and thus, usable. -- This message was sent by Atlassian JIRA (v6.3.4#6332)