From jira-return-10098-archive-asf-public=cust-asf.ponee.io@kafka.apache.org  Tue Feb 20 04:15:15 2018
Return-Path: <jira-return-10098-archive-asf-public=cust-asf.ponee.io@kafka.apache.org>
X-Original-To: archive-asf-public@cust-asf.ponee.io
Delivered-To: archive-asf-public@cust-asf.ponee.io
Received: from mail.apache.org (hermes.apache.org [140.211.11.3])
	by mx-eu-01.ponee.io (Postfix) with SMTP id A47C618067E
	for <archive-asf-public@cust-asf.ponee.io>; Tue, 20 Feb 2018 04:15:14 +0100 (CET)
Received: (qmail 94338 invoked by uid 500); 20 Feb 2018 03:15:13 -0000
Mailing-List: contact jira-help@kafka.apache.org; run by ezmlm
Precedence: bulk
List-Help: <mailto:jira-help@kafka.apache.org>
List-Unsubscribe: <mailto:jira-unsubscribe@kafka.apache.org>
List-Post: <mailto:jira@kafka.apache.org>
List-Id: <jira.kafka.apache.org>
Reply-To: jira@kafka.apache.org
Delivered-To: mailing list jira@kafka.apache.org
Received: (qmail 94306 invoked by uid 99); 20 Feb 2018 03:15:11 -0000
Received: from pnap-us-west-generic-nat.apache.org (HELO spamd1-us-west.apache.org) (209.188.14.142)
    by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 20 Feb 2018 03:15:11 +0000
Received: from localhost (localhost [127.0.0.1])
	by spamd1-us-west.apache.org (ASF Mail Server at spamd1-us-west.apache.org) with ESMTP id 81CAEC0AFA
	for <jira@kafka.apache.org>; Tue, 20 Feb 2018 03:15:10 +0000 (UTC)
X-Virus-Scanned: Debian amavisd-new at spamd1-us-west.apache.org
X-Spam-Flag: NO
X-Spam-Score: -110.311
X-Spam-Level:
X-Spam-Status: No, score=-110.311 tagged_above=-999 required=6.31
	tests=[ENV_AND_HDR_SPF_MATCH=-0.5, RCVD_IN_DNSWL_MED=-2.3,
	SPF_PASS=-0.001, T_RP_MATCHES_RCVD=-0.01, USER_IN_DEF_SPF_WL=-7.5,
	USER_IN_WHITELIST=-100] autolearn=disabled
Received: from mx1-lw-us.apache.org ([10.40.0.8])
	by localhost (spamd1-us-west.apache.org [10.40.0.7]) (amavisd-new, port 10024)
	with ESMTP id hKrf0MSwwda8 for <jira@kafka.apache.org>;
	Tue, 20 Feb 2018 03:15:08 +0000 (UTC)
Received: from mailrelay1-us-west.apache.org (mailrelay1-us-west.apache.org [209.188.14.139])
	by mx1-lw-us.apache.org (ASF Mail Server at mx1-lw-us.apache.org) with ESMTP id EC1EC5F36F
	for <jira@kafka.apache.org>; Tue, 20 Feb 2018 03:15:07 +0000 (UTC)
Received: from jira-lw-us.apache.org (unknown [207.244.88.139])
	by mailrelay1-us-west.apache.org (ASF Mail Server at mailrelay1-us-west.apache.org) with ESMTP id CB341E00EB
	for <jira@kafka.apache.org>; Tue, 20 Feb 2018 03:15:06 +0000 (UTC)
Received: from jira-lw-us.apache.org (localhost [127.0.0.1])
	by jira-lw-us.apache.org (ASF Mail Server at jira-lw-us.apache.org) with ESMTP id 8B26021E5B
	for <jira@kafka.apache.org>; Tue, 20 Feb 2018 03:15:03 +0000 (UTC)
Date: Tue, 20 Feb 2018 03:15:00 +0000 (UTC)
From: "Matthias J. Sax (JIRA)" <jira@apache.org>
To: jira@kafka.apache.org
Message-ID: <JIRA.12956307.1459894133000.231656.1519096500226@Atlassian.JIRA>
In-Reply-To: <JIRA.12956307.1459894133000@Atlassian.JIRA>
References: <JIRA.12956307.1459894133000@Atlassian.JIRA> <JIRA.12956307.1459894133156@jira-lw-us.apache.org>
Subject: [jira] [Assigned] (KAFKA-3514) Stream timestamp computation needs
 some further thoughts
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit
X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394


     [ https://issues.apache.org/jira/browse/KAFKA-3514?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Matthias J. Sax reassigned KAFKA-3514:
--------------------------------------

    Assignee: Matthias J. Sax

> Stream timestamp computation needs some further thoughts
> --------------------------------------------------------
>
>                 Key: KAFKA-3514
>                 URL: https://issues.apache.org/jira/browse/KAFKA-3514
>             Project: Kafka
>          Issue Type: Bug
>          Components: streams
>            Reporter: Guozhang Wang
>            Assignee: Matthias J. Sax
>            Priority: Major
>              Labels: architecture
>
> Our current stream task's timestamp is used for punctuate function as well as selecting which stream to process next (i.e. best effort stream synchronization). And it is defined as the smallest timestamp over all partitions in the task's partition group. This results in two unintuitive corner cases:
> 1) observing a late arrived record would keep that stream's timestamp low for a period of time, and hence keep being process until that late record. For example take two partitions within the same task annotated by their timestamps:
> {code}
> Stream A: 5, 6, 7, 8, 9, 1, 10
> {code}
> {code}
> Stream B: 2, 3, 4, 5
> {code}
> The late arrived record with timestamp "1" will cause stream A to be selected continuously in the thread loop, i.e. messages with timestamp 5, 6, 7, 8, 9 until the record itself is dequeued and processed, then stream B will be selected starting with timestamp 2.
> 2) an empty buffered partition will cause its timestamp to be not advanced, and hence the task timestamp as well since it is the smallest among all partitions. This may not be a severe problem compared with 1) above though.
> *Update*
> There is one more thing to consider (full discussion found here: http://search-hadoop.com/m/Kafka/uyzND1iKZJN1yz0E5?subj=Order+of+punctuate+and+process+in+a+stream+processor)
> {quote}
> Let's assume the following case.
> - a stream processor that uses the Processor API
> - context.schedule(1000) is called in the init()
> - the processor reads only one topic that has one partition
> - using custom timestamp extractor, but that timestamp is just a wall 
> clock time
> Image the following events:
> 1., for 10 seconds I send in 5 messages / second
> 2., does not send any messages for 3 seconds
> 3., starts the 5 messages / second again
> I see that punctuate() is not called during the 3 seconds when I do not 
> send any messages. This is ok according to the documentation, because 
> there is not any new messages to trigger the punctuate() call. When the 
> first few messages arrives after a restart the sending (point 3. above) I 
> see the following sequence of method calls:
> 1., process() on the 1st message
> 2., punctuate() is called 3 times
> 3., process() on the 2nd message
> 4., process() on each following message
> What I would expect instead is that punctuate() is called first and then 
> process() is called on the messages, because the first message's timestamp 
> is already 3 seconds older then the last punctuate() was called, so the 
> first message belongs after the 3 punctuate() calls.
> {quote}


--
This message was sent by Atlassian JIRA
(v7.6.3#76005)