kafka-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jun Rao (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (KAFKA-691) Fault tolerance broken with replication factor 1
Date Thu, 10 Jan 2013 18:00:16 GMT

    [ https://issues.apache.org/jira/browse/KAFKA-691?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13549842#comment-13549842
] 

Jun Rao commented on KAFKA-691:
-------------------------------

Thanks for the patch. Overall, the patch is pretty good and is well thought out. Some comments:

1. DefaultEventHandler:
1.1 In handle(), I don't think we need to add the if test in the following statement. The
reason is that a message could fail to be sent because the leader changes immediately after
the previous metadata refresh. Normally, leaders are elected very quickly. So, it makes sense
to refresh the metadata again.
          if (topicMetadataToRefresh.nonEmpty)
              Utils.swallowError(brokerPartitionInfo.updateInfo(outstandingProduceRequests.map(_.topic).toSet))
1.2 In handle(), it seems that it's better to call the following code before dispatchSerializedData().
        if (topicMetadataRefreshInterval >= 0 &&
            SystemTime.milliseconds - lastTopicMetadataRefresh > topicMetadataRefreshInterval)
{
          Utils.swallowError(brokerPartitionInfo.updateInfo(topicMetadataToRefresh.toSet))
          topicMetadataToRefresh.clear
          lastTopicMetadataRefresh = SystemTime.milliseconds
        }
1.3 getPartition(): If none of the partitions is available, we should throw LeaderNotAvailableException,
instead of UnknownTopicOrPartitionException.

2. DefaultPartitioner: Since key is not expected to be null, we should remove the code that
deals with null key. 

3. The consumer side logic is fine. The consumer rebalance is only triggered when there are
changes in partitions, not when there are changes in the availability of the partition. The
rebalance logic doesn't depend on a partition being available. If a partition is not available,
ConsumerFetcherManager will keep refreshing metadata. If you have a replication factor of
1, you will need to set a larger refresh.leader.backoff.ms, if a broker is expected to go
down for a long time. 
                
> Fault tolerance broken with replication factor 1
> ------------------------------------------------
>
>                 Key: KAFKA-691
>                 URL: https://issues.apache.org/jira/browse/KAFKA-691
>             Project: Kafka
>          Issue Type: Bug
>    Affects Versions: 0.8
>            Reporter: Jay Kreps
>         Attachments: KAFKA-691-v1.patch
>
>
> In 0.7 if a partition was down we would just send the message elsewhere. This meant that
the partitioning was really more of a "stickiness" then a hard guarantee. This made it impossible
to depend on it for partitioned, stateful processing.
> In 0.8 when running with replication this should not be a problem generally as the partitions
are now highly available and fail over to other replicas. However in the case of replication
factor = 1 no longer really works for most cases as now a dead broker will give errors for
that broker.
> I am not sure of the best fix. Intuitively I think this is something that should be handled
by the Partitioner interface. However currently the partitioner has no knowledge of which
nodes are available. So you could use a random partitioner, but that would keep going back
to the down node.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Mime
View raw message