incubator-cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Yang <teddyyyy...@gmail.com>
Subject Re: messages stopped for 3 minutes?
Date Mon, 26 Sep 2011 07:25:09 GMT
ok, I got more detailed logging that seems to give some clue out of
this myth ....

I enabled TRACE level logging for Gossiper, also added a debug to
MessageDeliveryTask so that every arrival msg shows up in logging.

I have 2 nodes, all the requests come from the coordinator node,
basically I see 2 possible issues.

1)
in the front of bad_coord.txt, we see that the coordinator was
receiving messages correctly until

DEBUG [RequestResponseStage:22] 2011-09-26 05:43:35,521
MessageDeliveryTask.java (line 60) delivering msg


then the line right after it is
DEBUG [RequestResponseStage:24] 2011-09-26 05:43:59,516
MessageDeliveryTask.java (line 60) delivering msg

so there is a 24 seconds delay , in this period, no messages were
delivered on the coordinator. this is enough to trigger a Timeout and
Unavailable. indeed we see time out  and UnavailableException later in
the log.


during this period, on the replica box, in bad_replica.txt we see
correct message delivery up to
DEBUG [ReadStage:107] 2011-09-26 05:43:35,214 MessageDeliveryTask.java
(line 60) delivering msg

then on the next line, it starts doing something different:
TRACE [GossipTasks:1] 2011-09-26 05:43:35,570 Gossiper.java (line 125)
My heartbeat is now 5721
TRACE [GossipTasks:1] 2011-09-26 05:43:35,570 Gossiper.java (line 357)
Gossip Digests are : /10.71.13.5:1317009996:5765
ec2-50-19-172-99.compute-1.amazonaws.com/10.71.21.46:1317010027:5721


I did not read the code in detail to understand what this means, but
it seems to correspond with the "silent period" on the coord in terms
of time. it keeps on doing this till
 INFO [GossipTasks:1] 2011-09-26 05:43:53,638 Gossiper.java (line 721)
InetAddress /10.71.13.5 is now dead.

and
DEBUG [MutationStage:55] 2011-09-26 05:43:59,208
MessageDeliveryTask.java (line 60) delivering msg
 INFO [GossipStage:1] 2011-09-26 05:43:59,211 Gossiper.java (line 707)
InetAddress /10.71.13.5 is now UP


===== so, here the problem is that the messages stopped showing up
from 05:43:35 to 05:43:59 on the coord, though the replica is trying
to send them . not sure if this is due to network.


2) the other problem actually causes a longer delay on application side
in the end of bad_coord.txt we see that the TimeoutException and
AvailableException were printed out actually much later, after
06:06:33
TRACE [GossipStage:1] 2011-09-26 06:06:33,107 Gossiper.java (line 802)
/10.71.21.46local generation 1317010027, remote generation 1317010027
TRACE [GossipStage:1] 2011-09-26 06:06:33,107 Gossiper.java (line 847)
Updating heartbeat state version to 7119 from 7118 for /10.71.21.46
...
ERROR [New I/O server worker #1-15] 2011-09-26 05:43:59,500
MY_APPLICATION_CODE.java (line 234) :
Caused by: TimedOutException()

** the main issue here is that the timestamp of the logging line
itself says 05:43:59, which is correctly when the TimeOutException was
generated**, but the line was actually not printed out till 06:06:33,
23minutes later. this causes a 23minutes pause visible to the clients.

this 23minutes delay is much more damaging than the 24 seconds message
drop period in 1). I wonder what could have caused this 23minutes
delay. could be lock contention on grabbing the log4j appender ?? but
it shouldn't be, since I only added the extra logging AFTER I saw the
problem appear before.


Thanks for your patience for reading thus far :)


Yang





On Sun, Sep 25, 2011 at 10:32 AM, Yang <teddyyyy123@gmail.com> wrote:
> thanks Jonathan,
>
>
> I really don't know, I just did further tests to catch the jstack on
> the receiving side over the last night. going through these stacks
> now.  if I can't find anything suspicious, I'll add these debugging to
> the sending side too.
>
> another useful piece of info: when I did a single-node setup, I also
> found a lot of TimedOutException, similar to what I see with the
> 2-node setup. I think I didn't see the UnavailableException, probably
> because it's just a single node, and the node always believes itself
> to be available.
>
> I think GC issue is not the culprit here, since I don't see any length
> GC logging around when the delay is happening. no compaction/flushing
> either
>
>
>
> On Sun, Sep 25, 2011 at 6:33 AM, Jonathan Ellis <jbellis@gmail.com> wrote:
>> What makes you think the problem is on the receiving node, rather than
>> the sending node?
>>
>> On Sun, Sep 25, 2011 at 1:19 AM, Yang <teddyyyy123@gmail.com> wrote:
>>> I constantly see TimedOutException , then followed by
>>> UnavailableException in my logs,
>>> so I added some extra debugging to Gossiper. notifyFailureDetector()
>>>
>>>
>>>
>>>    void notifyFailureDetector(InetAddress endpoint, EndpointState
>>> remoteEndpointState)
>>>    {
>>>        IFailureDetector fd = FailureDetector.instance;
>>>        EndpointState localEndpointState = endpointStateMap.get(endpoint);
>>>        logger.debug("notify failure detector");
>>>        /*
>>>         * If the local endpoint state exists then report to the FD only
>>>         * if the versions workout.
>>>        */
>>>        if ( localEndpointState != null )
>>>        {
>>>                logger.debug("notify failure detector, endpoint");
>>>            int localGeneration =
>>> localEndpointState.getHeartBeatState().getGeneration();
>>>            int remoteGeneration =
>>> remoteEndpointState.getHeartBeatState().getGeneration();
>>>            if ( remoteGeneration > localGeneration )
>>>            {
>>>                localEndpointState.updateTimestamp();
>>>                logger.debug("notify failure detector --- report 1");
>>>                fd.report(endpoint);
>>>                return;
>>>            }
>>>
>>>
>>>
>>>
>>> then I found that this method stopped being called for a period of 3
>>> minutes, so of course the detector considers the other side to be
>>> dead.
>>>
>>> but since these 2 boxes are in the same EC2 region, same security
>>> group, there is no reason there is a network issue that long. so I
>>> ran a background job that just does
>>>
>>> echo | nc $the_other_box 7000   in a loop
>>>
>>> and this always works fine, without failing to contact the 7000 port.
>>>
>>>
>>> so somehow the messages were not delivered or received, how could I debug this?
>>> (extra logging attached)
>>>
>>> Thanks
>>> Yang
>>>
>>
>>
>>
>> --
>> Jonathan Ellis
>> Project Chair, Apache Cassandra
>> co-founder of DataStax, the source for professional Cassandra support
>> http://www.datastax.com
>>
>

Mime
View raw message