From issues-return-154355-archive-asf-public=cust-asf.ponee.io@hive.apache.org  Thu Mar 28 04:48:03 2019
Return-Path: <issues-return-154355-archive-asf-public=cust-asf.ponee.io@hive.apache.org>
X-Original-To: archive-asf-public@cust-asf.ponee.io
Delivered-To: archive-asf-public@cust-asf.ponee.io
Received: from mail.apache.org (hermes.apache.org [140.211.11.3])
	by mx-eu-01.ponee.io (Postfix) with SMTP id 85FE9180648
	for <archive-asf-public@cust-asf.ponee.io>; Thu, 28 Mar 2019 05:48:02 +0100 (CET)
Received: (qmail 2706 invoked by uid 500); 28 Mar 2019 04:48:01 -0000
Mailing-List: contact issues-help@hive.apache.org; run by ezmlm
Precedence: bulk
List-Help: <mailto:issues-help@hive.apache.org>
List-Unsubscribe: <mailto:issues-unsubscribe@hive.apache.org>
List-Post: <mailto:issues@hive.apache.org>
List-Id: <issues.hive.apache.org>
Reply-To: dev@hive.apache.org
Delivered-To: mailing list issues@hive.apache.org
Received: (qmail 2695 invoked by uid 99); 28 Mar 2019 04:48:01 -0000
Received: from mailrelay1-us-west.apache.org (HELO mailrelay1-us-west.apache.org) (209.188.14.139)
    by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 28 Mar 2019 04:48:01 +0000
Received: from jira-lw-us.apache.org (unknown [207.244.88.139])
	by mailrelay1-us-west.apache.org (ASF Mail Server at mailrelay1-us-west.apache.org) with ESMTP id 2AD5DE0D27
	for <issues@hive.apache.org>; Thu, 28 Mar 2019 04:48:01 +0000 (UTC)
Received: from jira-lw-us.apache.org (localhost [127.0.0.1])
	by jira-lw-us.apache.org (ASF Mail Server at jira-lw-us.apache.org) with ESMTP id 69706245A2
	for <issues@hive.apache.org>; Thu, 28 Mar 2019 04:48:00 +0000 (UTC)
Date: Thu, 28 Mar 2019 04:48:00 +0000 (UTC)
From: "Gopal V (JIRA)" <jira@apache.org>
To: issues@hive.apache.org
Message-ID: <JIRA.13224475.1553742084000.141797.1553748480429@Atlassian.JIRA>
In-Reply-To: <JIRA.13224475.1553742084000@Atlassian.JIRA>
References: <JIRA.13224475.1553742084000@Atlassian.JIRA> <JIRA.13224475.1553742084087@jira-lw-us.apache.org>
Subject: [jira] [Comment Edited] (HIVE-21530) Replicate Streaming ingest on
 ACID tables.
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit
X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394


    [ https://issues.apache.org/jira/browse/HIVE-21530?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16803583#comment-16803583 ] 

Gopal V edited comment on HIVE-21530 at 3/28/19 4:47 AM:
---------------------------------------------------------

bq. remove side files ( which looks like are suffixed as _flush in file names) when the batch is committed.

Streamingv2/ACIDv2 does not generate flush length files - both the Spark and the NiFi implementations do not generate them.

That was a deliberate choice to simplify REPL for streaming ingest (well, I say that - but it also meant that streaming ingest would on filesystems without hflush data consistency support for multiple files - I can think of only one FS which implements it without potential for data loss)


was (Author: gopalv):
bq. remove side files ( which looks like are suffixed as _flush in file names) when the batch is committed.

Streamingv2/ACIDv2 does not generate flush length files - both the Spark and the NiFi implementations do not generate them.

That was a deliberate choice to simplify REPL for streaming ingest.


> Replicate Streaming ingest on ACID tables.
> ------------------------------------------
>
>                 Key: HIVE-21530
>                 URL: https://issues.apache.org/jira/browse/HIVE-21530
>             Project: Hive
>          Issue Type: Sub-task
>          Components: repl, Transactions
>    Affects Versions: 4.0.0
>            Reporter: Sankar Hariappan
>            Assignee: mahesh kumar behera
>            Priority: Major
>              Labels: DR, Replication
>         Attachments: Hive ACID Replication_ Streaming Ingest Tables.pdf
>
>
> implement replication of hive streaming ingest of tables as per  [^Hive ACID Replication_ Streaming Ingest Tables.pdf] .
> changes to txn_commit to include information about transaction batch.
> changes to copy task to only copy if there is a difference in file size or checksum, seems specific to transaction batch shouldnt be used for normal transactions.
> copy the correct sequence of files w.r.t data file + side file.
> remove side files ( which looks like are suffixed as _flush in file names) when the batch is committed.
> how do we determine the idempotent nature of the events here, update the corresponding table + partition and not copy new version of the file.
> validate if partial copied data files are handled on the target warehouse given correct side file. can we leave the side file file forever, in case during transaction batch copy after certain transactions are copied over then primary warehouse fails. we wont be able to remove _flush file, on failover do we have to handle this. 


--
This message was sent by Atlassian JIRA
(v7.6.3#76005)