Return-Path: X-Original-To: apmail-spark-user-archive@minotaur.apache.org Delivered-To: apmail-spark-user-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id D6FF0181EE for ; Fri, 9 Oct 2015 11:34:23 +0000 (UTC) Received: (qmail 9440 invoked by uid 500); 9 Oct 2015 11:34:20 -0000 Delivered-To: apmail-spark-user-archive@spark.apache.org Received: (qmail 9343 invoked by uid 500); 9 Oct 2015 11:34:20 -0000 Mailing-List: contact user-help@spark.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Delivered-To: mailing list user@spark.apache.org Received: (qmail 9333 invoked by uid 99); 9 Oct 2015 11:34:20 -0000 Received: from Unknown (HELO spamd4-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 09 Oct 2015 11:34:20 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd4-us-west.apache.org (ASF Mail Server at spamd4-us-west.apache.org) with ESMTP id B4057C059F for ; Fri, 9 Oct 2015 11:34:19 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd4-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 1.314 X-Spam-Level: * X-Spam-Status: No, score=1.314 tagged_above=-999 required=6.31 tests=[URIBL_BLOCKED=0.001, URI_HEX=1.313] autolearn=disabled Received: from mx1-eu-west.apache.org ([10.40.0.8]) by localhost (spamd4-us-west.apache.org [10.40.0.11]) (amavisd-new, port 10024) with ESMTP id MpstSZAENQ7g for ; Fri, 9 Oct 2015 11:34:06 +0000 (UTC) Received: from mwork.nabble.com (mwork.nabble.com [162.253.133.43]) by mx1-eu-west.apache.org (ASF Mail Server at mx1-eu-west.apache.org) with ESMTP id E185621247 for ; Fri, 9 Oct 2015 11:34:05 +0000 (UTC) Received: from mben.nabble.com (unknown [162.253.133.72]) by mwork.nabble.com (Postfix) with ESMTP id 9A6592ABEDF8 for ; Fri, 9 Oct 2015 04:34:51 -0700 (PDT) Date: Fri, 9 Oct 2015 04:34:04 -0700 (MST) From: bitborn To: user@spark.apache.org Message-ID: <1444390444949-24995.post@n3.nabble.com> Subject: Kafka streaming "at least once" semantics MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Hi all, My company is using Spark streaming and the Kafka API's to process an event stream. We've got most of our application written, but are stuck on "at least once" processing. I created a demo to show roughly what we're doing here: https://github.com/bitborn/resilient-kafka-streaming-in-spark The problem we're having is when the application experiences an exception (network issue, out of memory, etc) it will drop the batch it's processing. The ideal behavior is it will process each event "at least once" even if that means processing it more than once. Whether this happens via checkpointing, WAL, or kafka offsets is irrelevant, as long as we don't drop data. :) A couple of things we've tried: - Using the kafka direct stream API (via Cody Koeninger ) - Using checkpointing with both the low-level and high-level API's - Enabling the write ahead log I've included a log here spark.log , but I'm afraid it doesn't reveal much. The fact that others seem to be able to get this working properly suggests we're missing some magic configuration or are possibly executing it in a way that won't support the desired behavior. I'd really appreciate some pointers! Thanks much, Andrew Clarkson -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Kafka-streaming-at-least-once-semantics-tp24995.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscribe@spark.apache.org For additional commands, e-mail: user-help@spark.apache.org