Return-Path: X-Original-To: apmail-cassandra-user-archive@www.apache.org Delivered-To: apmail-cassandra-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 96F4A77E5 for ; Thu, 1 Dec 2011 09:04:47 +0000 (UTC) Received: (qmail 13964 invoked by uid 500); 1 Dec 2011 09:04:44 -0000 Delivered-To: apmail-cassandra-user-archive@cassandra.apache.org Received: (qmail 13819 invoked by uid 500); 1 Dec 2011 09:04:23 -0000 Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@cassandra.apache.org Delivered-To: mailing list user@cassandra.apache.org Received: (qmail 13786 invoked by uid 99); 1 Dec 2011 09:04:09 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 01 Dec 2011 09:04:09 +0000 X-ASF-Spam-Status: No, hits=2.2 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_NEUTRAL X-Spam-Check-By: apache.org Received-SPF: neutral (athena.apache.org: local policy) Received: from [209.85.215.44] (HELO mail-lpp01m010-f44.google.com) (209.85.215.44) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 01 Dec 2011 09:04:01 +0000 Received: by lahj13 with SMTP id j13so703603lah.31 for ; Thu, 01 Dec 2011 01:03:39 -0800 (PST) Received: by 10.152.110.102 with SMTP id hz6mr3934941lab.11.1322730219455; Thu, 01 Dec 2011 01:03:39 -0800 (PST) Received: from [192.168.2.92] (81-94-164-42.customer.itmastaren.net. [81.94.164.42]) by mx.google.com with ESMTPS id ne3sm4426989lab.7.2011.12.01.01.03.37 (version=SSLv3 cipher=OTHER); Thu, 01 Dec 2011 01:03:38 -0800 (PST) Message-ID: <4ED742D7.4040108@sitevision.se> Date: Thu, 01 Dec 2011 10:03:19 +0100 From: =?ISO-8859-1?Q?Fredrik_L_Stigb=E4ck?= User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:8.0) Gecko/20111105 Thunderbird/8.0 MIME-Version: 1.0 To: user@cassandra.apache.org Subject: Hinted handoff bug? Content-Type: multipart/alternative; boundary="------------090006010705020704040905" This is a multi-part message in MIME format. --------------090006010705020704040905 Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Hi, We,re running cassandra 1.0.3. I've done some testing with 2 nodes (node A, node B), replication factor 2. I take node A down, writing some data to node B and then take node A up. Sometimes hints aren't delivered when node A comes up. I've done some debugging in org.apache.cassandra.db.HintedHandOffManager and sometimes node B ends up in a strange state in method org.apache.cassandra.db.HintedHandOffManager.deliverHints(final InetAddress to), where org.apache.cassandra.db.HintedHandOffManager.queuedDeliveries already has node A in it's Set and therefore no hints will ever be delivered to node A. The only reason for this that I can see is that in org.apache.cassandra.db.HintedHandOffManager.deliverHintsToEndpoint(InetAddress endpoint) the hintStore.isEmpty() check returns true and the endpoint (node A) isn't removed from org.apache.cassandra.db.HintedHandOffManager.queuedDeliveries. Then no hints will ever be delivered again until node B is restarted. During what conditions will hintStore.isEmpty() return true? Shouldn't the hintStore.isEmpty() check be inside the try {} finally{} clause, removing the endpoint from queuedDeliveries in the finally block? public void deliverHints(final InetAddress to) { logger_.debug("deliverHints to {}", to); *if (!queuedDeliveries.add(to))* return; ....... } private void deliverHintsToEndpoint(InetAddress endpoint) throws IOException, DigestMismatchException, InvalidRequestException, TimeoutException, { ColumnFamilyStore hintStore = Table.open(Table.SYSTEM_TABLE).getColumnFamilyStore(HINTS_CF); * if (hintStore.isEmpty())* return; // nothing to do, don't confuse users by logging a no-op handoff try { ...... } finally { *queuedDeliveries.remove(endpoint);* } } Regards /Fredrik --------------090006010705020704040905 Content-Type: text/html; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit Hi,
We,re running cassandra 1.0.3.
I've done some testing with 2 nodes (node A, node B), replication factor 2.
I take node A down, writing some data to node B and then take node A up.
Sometimes hints aren't delivered when node A comes up.

I've done some debugging in org.apache.cassandra.db.HintedHandOffManager and sometimes node B ends up in a strange state in method
org.apache.cassandra.db.HintedHandOffManager.deliverHints(final InetAddress to), where org.apache.cassandra.db.HintedHandOffManager.queuedDeliveries already has node A in it's Set and therefore no hints will ever be delivered to node A.
The only reason for this that I can see is that in org.apache.cassandra.db.HintedHandOffManager.deliverHintsToEndpoint(InetAddress endpoint) the hintStore.isEmpty() check returns true and the endpoint (node A)  isn't removed from org.apache.cassandra.db.HintedHandOffManager.queuedDeliveries. Then no hints will ever be delivered again until node B is restarted.
During what conditions will hintStore.isEmpty() return true?
Shouldn't the hintStore.isEmpty() check be inside the try {} finally{} clause, removing the endpoint from queuedDeliveries in the finally block?

public void deliverHints(final InetAddress to)
{
        logger_.debug("deliverHints to {}", to);
        if (!queuedDeliveries.add(to))
            return;
        .......
}

private void deliverHintsToEndpoint(InetAddress endpoint) throws IOException, DigestMismatchException, InvalidRequestException, TimeoutException,
{
        ColumnFamilyStore hintStore = Table.open(Table.SYSTEM_TABLE).getColumnFamilyStore(HINTS_CF);
        if (hintStore.isEmpty())
            return; // nothing to do, don't confuse users by logging a no-op handoff
    try
    {
        ......              
    }
    finally
    {
            queuedDeliveries.remove(endpoint);
    }
}

Regards
/Fredrik
--------------090006010705020704040905--