Return-Path: X-Original-To: apmail-samza-dev-archive@minotaur.apache.org Delivered-To: apmail-samza-dev-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 487E818AF4 for ; Tue, 19 Jan 2016 11:11:53 +0000 (UTC) Received: (qmail 8461 invoked by uid 500); 19 Jan 2016 11:11:53 -0000 Delivered-To: apmail-samza-dev-archive@samza.apache.org Received: (qmail 8402 invoked by uid 500); 19 Jan 2016 11:11:53 -0000 Mailing-List: contact dev-help@samza.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@samza.apache.org Delivered-To: mailing list dev@samza.apache.org Received: (qmail 8390 invoked by uid 99); 19 Jan 2016 11:11:52 -0000 Received: from Unknown (HELO spamd4-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 19 Jan 2016 11:11:52 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd4-us-west.apache.org (ASF Mail Server at spamd4-us-west.apache.org) with ESMTP id 4DC67C00ED for ; Tue, 19 Jan 2016 11:11:52 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd4-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 2.879 X-Spam-Level: ** X-Spam-Status: No, score=2.879 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, HTML_MESSAGE=3, RCVD_IN_MSPIKE_H3=-0.01, RCVD_IN_MSPIKE_WL=-0.01, SPF_PASS=-0.001] autolearn=disabled Authentication-Results: spamd4-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com Received: from mx1-eu-west.apache.org ([10.40.0.8]) by localhost (spamd4-us-west.apache.org [10.40.0.11]) (amavisd-new, port 10024) with ESMTP id 0x4y5RgBhAqv for ; Tue, 19 Jan 2016 11:11:51 +0000 (UTC) Received: from mail-lf0-f51.google.com (mail-lf0-f51.google.com [209.85.215.51]) by mx1-eu-west.apache.org (ASF Mail Server at mx1-eu-west.apache.org) with ESMTPS id 344522565F for ; Tue, 19 Jan 2016 11:11:50 +0000 (UTC) Received: by mail-lf0-f51.google.com with SMTP id h129so141680862lfh.3 for ; Tue, 19 Jan 2016 03:11:50 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :content-type; bh=swvJ5JofkWs2qPtanfzgL0a6JtsdpisDfdJpXAnDUWE=; b=MagT1VDkzUtLXQS7A4zzbF7yxGQ8nYNMEVimzwWU/YUtBsjchVsQff0q7+kQic89AR v1uG96nOlljisRBUXIw6E9boizNQf5CXsp+ZGdiBYGwE+YGQeLBXCkvjuUvF/3BOIKn6 yW9dYyjAp39vfJ4rWeXl4MLMgUoEEMiaLM0rDXF+gjNsbiuUYhr3Ar8syHQglUVTkEbO bgTYnqYC4/4yXyDDxFAhf9qa1ZeZezk87gYMVeh8t3E5NM+wV7zwED3M4aXsFkwSbh0O Y7JNmJnaGoDzwiShTaAEqN4MgtWgs0OSxG4ehBItfqtlh9Zc/kRAGW/kH1jEq2pnw3rA na5g== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:in-reply-to:references:from:date :message-id:subject:to:content-type; bh=swvJ5JofkWs2qPtanfzgL0a6JtsdpisDfdJpXAnDUWE=; b=iF4rRKz3RSnTMpc3ZStCd5iiUKW4nJfKvsGwx4K2z3ITzVB6No5h6OK+H6rgIgpRYT izGsXaGY/238gaJbbEGtOagnjF56FusjYIv8JOJys4ZZ3ip4q1VJBPAnXugu2fu9jobn N+OvrkZn1wjJ6IQFfIoztCB0ne3F1JOURdK44IbdUQNstm7tMZSjAPJfoUVid1dw+lIX bYVWwh/wyWTUjT7dc+ytE6TaTqyRA2vpSeGF2gF88YAEUGMPdKw7RZ8P4SR9nm9gWPMQ Ln9S3ja0+h1Gs7v08xSNtiB1WiBoqJkd3KFM777ngFOADJ6CdN5YAOz/tev6bkhjnQLs hA4w== X-Gm-Message-State: ALoCoQmcQ8TcWqfMzqJJI0dFME3Z0QQDz1ln64TOEOBVYnKBxB51Cn/TUgD9XPHzaSsmKv4jeNrXTbr7WFnjxfNvB59koYL57Q== X-Received: by 10.25.158.147 with SMTP id h141mr10479501lfe.147.1453201909403; Tue, 19 Jan 2016 03:11:49 -0800 (PST) MIME-Version: 1.0 Received: by 10.25.3.211 with HTTP; Tue, 19 Jan 2016 03:11:30 -0800 (PST) In-Reply-To: References: From: Michael Sklyar Date: Tue, 19 Jan 2016 13:11:30 +0200 Message-ID: Subject: Re: Data consistency and check-pointing To: dev@samza.apache.org Content-Type: multipart/alternative; boundary=001a1141020098d6830529adef6e --001a1141020098d6830529adef6e Content-Type: text/plain; charset=UTF-8 Hi Yi! I just figured it out yesterday and was about to send an update:) Yes, it covers our use case perfectly. Thanks, Michael On Tue, Jan 19, 2016 at 5:15 AM, Yi Pan wrote: > Hi, Michael, > > Your use case sounds much like a "customized checkpointing" to me. We have > similar cases in LinkedIn and the following are the solution in production: > 1) disable Samza auto-checkpoint by setting the commit_ms to -1 > 2) explicitly calling TaskCoordinator.commit() in sync with closing the > transaction batch > > The above procedure works well and gives user to ability to control the > commit of checkpoint together w/ your transaction batch. In case of system > crash between the closing of transaction batch and the checkpoint commit (I > am assuming this sequence of actions), we would follow the at-least-once > semantics and re-play the messages from the last commit. > > Please let us know whether that satisfies your use case. > > Thanks! > > -Yi > > On Sun, Jan 17, 2016 at 11:09 AM, Michael Sklyar > wrote: > > > Hi, > > > > We have a Samza job reading messages from Kafka and inserting to hive via > > the Hive Streaming API. With Hive Streaming we are using > > "TransactionBatch", closing the Transaction batch closes the file on > HDFS. > > We close the transaction batch after reaching the a. Maximum messages per > > transaction batch or b. time threshold (for example - every 20K messages > or > > every 10 seconds). > > > > It works well, but in cases the job will terminate in the middle of a > > transaction batch we will have data inconsistency in hive, either: > > > > 1. Duplication: Data that was already inserted to hive will be processed > > again (since the checkpoint was taken earlier than the latest message > > written to hive). > > > > > > > > > > > > > > > > > > > > 2. Missing Data: Messages that were not committed to hive yet will not be > > reprocessed (since the checkpoint was written after > > > > What would be the recommended method of synchronizing hive/hdfs insertion > > with Samza checkpointing? I am thinking of overriding the > > *KafkaCheckpointManager* & *KafkaCheckpointManagerFactory* and > > synchronize check-pointing with > > committing the data to hive. Is it a good idea? > > > > Thanks in advance, > > Michael Sklyar > > > --001a1141020098d6830529adef6e--