kafka-jira mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Charles Crain (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (KAFKA-6249) Interactive query downtime when node goes down even with standby replicas
Date Wed, 22 Nov 2017 19:46:00 GMT

    [ https://issues.apache.org/jira/browse/KAFKA-6249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16263231#comment-16263231
] 

Charles Crain commented on KAFKA-6249:
--------------------------------------

KAFKA-6144 is definitely related.  Solving it may solve my issue as long as stale data includes
data from standby replicas.

Question: is it possible to query a standby replica of a state store?  Let me elaborate: I
did an experiment where I ran 3 replicas of a Kafka Stream app, and I printed out the results
of querying a particular key from a state store on all 3.  As expected, with zero standby
replicas, 1 replica returned the data while the other 2 returned null.  

However when I set the standby replicas config to 1, this still happened.  I would have naively
expected 2 of the 3 replicas to return valid data for a particular key.  Perhaps this is intended
behavior, i.e. the standby replica is "hidden" somehow until it is made live.  But, it would
be very useful if the replica of the state store data were able to be queried somehow.

In fact, it would be ideal if metadataForKey() would return all nodes that have data for a
particular key available, including standbys.  That way, if one replica fails we could try
another.  That, combined with KAFKA-6144 should allow implementation of queryable stores with
zero down time on node failure, as long as number of standby replicas >= total number of
nodes that fail before rebalance is complete.

> Interactive query downtime when node goes down even with standby replicas
> -------------------------------------------------------------------------
>
>                 Key: KAFKA-6249
>                 URL: https://issues.apache.org/jira/browse/KAFKA-6249
>             Project: Kafka
>          Issue Type: Bug
>          Components: streams
>    Affects Versions: 1.0.0
>            Reporter: Charles Crain
>
> In a multi-node Kafka Streams application that uses interactive queries, the queryable
store will become unavailable (throw InvalidStateStoreException) for up to several minutes
when a node goes down.  This happens regardless of how many nodes are in the application as
well as how many standby replicas are configured.
> My expectation is that if a standby replica is present, that the interactive query would
fail over to the live replica immediately causing negligible downtime for interactive queries.
 Instead, what appears to happen is that the queryable store is down for however long it takes
for the nodes to completely rebalance (this takes a few minutes for a couple GB of total data
in the queryable store's backing topic).
> I am filing this as a bug, realizing that it may in fact be a feature request.  However,
until there is a way we can use interactive queries with minimal (~zero) downtime on node
failure, we are having to entertain other strategies for serving queries (e.g. manually materializing
the topic to an external resilient store such as Cassandra) in order to meet our SLAs.
> If there is a way to minimize the downtime of interactive queries on node failure that
I am missing, I would like to know what it is.
> Our team is super-enthusiastic about Kafka Streams and we're keen to use it for just
about everything!  This is out only major roadblock.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Mime
View raw message