incubator-cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jonathan Ellis <jbel...@gmail.com>
Subject Re: get_key_range (CASSANDRA-169)
Date Wed, 09 Sep 2009 22:52:26 GMT
Okay, so when #5 comes back up, #1 eventually stops erroring out and
you don't have to restart #1?  That is good, that would have been a
bigger problem. :)

If you are comfortable using a Java debugger (by default Cassandra
listens for one on 8888) you can look at what is going on inside
StorageProxy.getKeyRange on node #1 at the call to

        EndPoint endPoint =
StorageService.instance().findSuitableEndPoint(command.startWith);

findSuitableEndpoint is supposed to pick a live node, not a dead one. :)

If not I can write a patch to log extra information for this bug so we
can track it down.

-Jonathan

On Wed, Sep 9, 2009 at 5:43 PM, Simon Smith<simongsmith@gmail.com> wrote:
> The error starts as soon as the downed node #5 goes down and lasts
> until I restart the downed node #5.
>
> bin/nodeprobe cluster is accurate (it knows quickly when #5 is down,
> and when it is up again)
>
> Since I set the replication set to 3, I'm confused as to why (after
> the first few seconds or so) there is an error just because one host
> is down temporarily.
>
> The way I have the test setup is that I have a script running on each
> of the nodes that is running the get_key_range over and over to
> "localhost".  Depending on which node I take down, the behavior
> varies: if I take done one host, it is the only one giving errors (the
> other 4 nodes still work).  For the other 4 situations, either 2 or 3
> nodes continue to work (i.e. the downed node and either one or two
> other nodes are the ones giving errors).  Note: the nodes that keep
> working, never fail at all, not even for a few seconds.
>
> I am running this on 4GB "cloud server" boxes in Rackspace, I can set
> up just about any test needed to help debug this and capture output or
> logs, and can give a Cassandra developer access if it would help.  Of
> course I can include whatever config files or log files would be
> helpful, I just don't want to spam the list unless it is relevant.
>
> Thanks again,
>
> Simon
>
>
> On Tue, Sep 8, 2009 at 6:26 PM, Jonathan Ellis<jbellis@gmail.com> wrote:
>> getting temporary errors when a node goes down, until the other nodes'
>> failure detectors realize it's down, is normal.  (this should only
>> take a dozen seconds, or so.)
>>
>> but after that it should route requests to other nodes, and it should
>> also realize when you restart #5 that it is alive again.  those are
>> two separate issues.
>>
>> can you verify that "bin/nodeprobe cluster" shows that node 1
>> eventually does/does not see #5 dead, and alive again?
>>
>> -Jonathan
>>
>> On Tue, Sep 8, 2009 at 5:05 PM, Simon Smith<simongsmith@gmail.com> wrote:
>>> I'm seeing an issue similar to:
>>>
>>> http://issues.apache.org/jira/browse/CASSANDRA-169
>>>
>>> Here is when I see it.  I'm running Cassandra on 5 nodes using the
>>> OrderPreservingPartitioner, and have populated Cassandra with 78
>>> records, and I can use get_key_range via Thrift just fine.  Then, if I
>>> manually kill one of the nodes (if I kill off node #5), the node (node
>>> #1) which I've been using to call get_key_range will timeout and the
>>> error:
>>>
>>>  Thrift: Internal error processing get_key_range
>>>
>>> And the Cassandra output shows the same trace as in 169:
>>>
>>> ERROR - Encountered IOException on connection:
>>> java.nio.channels.SocketChannel[closed]
>>> java.net.ConnectException: Connection refused
>>>        at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
>>>        at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:592)
>>>        at org.apache.cassandra.net.TcpConnection.connect(TcpConnection.java:349)
>>>        at org.apache.cassandra.net.SelectorManager.doProcess(SelectorManager.java:131)
>>>        at org.apache.cassandra.net.SelectorManager.run(SelectorManager.java:98)
>>> WARN - Closing down connection java.nio.channels.SocketChannel[closed]
>>> ERROR - Internal error processing get_key_range
>>> java.lang.RuntimeException: java.util.concurrent.TimeoutException:
>>> Operation timed out.
>>>        at org.apache.cassandra.service.StorageProxy.getKeyRange(StorageProxy.java:573)
>>>        at org.apache.cassandra.service.CassandraServer.get_key_range(CassandraServer.java:595)
>>>        at org.apache.cassandra.service.Cassandra$Processor$get_key_range.process(Cassandra.java:853)
>>>        at org.apache.cassandra.service.Cassandra$Processor.process(Cassandra.java:606)
>>>        at org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:253)
>>>        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
>>>        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
>>>        at java.lang.Thread.run(Thread.java:675)
>>> Caused by: java.util.concurrent.TimeoutException: Operation timed out.
>>>        at org.apache.cassandra.net.AsyncResult.get(AsyncResult.java:97)
>>>        at org.apache.cassandra.service.StorageProxy.getKeyRange(StorageProxy.java:569)
>>>        ... 7 more
>>>
>>>
>>>
>>> If it was giving an error just one time, I could just rely on catching
>>> the error and trying again.  But a get_key_range call to that node I
>>> was already making get_key_range queries against (node #1) never works
>>> again (it is still up and it responds fine to multiget Thrift calls),
>>> sometimes not even after I restart the down node (node #5).  I end up
>>> having to restart node #1 in addition to node #5.  The behavior for
>>> the other 3 nodes varies - some of them  are also unable to respond to
>>> get_key_range calls, but some of them do respond to get_key_range
>>> calls.
>>>
>>> My question is, what path should I go down in terms of reproducing
>>> this problem?  I'm using Aug 27 trunk code - should I update my
>>> Cassandra install prior to gathering more information for this issue,
>>> and if so, which version (0.4 or trunk).  If there is anyone who is
>>> familiar with this issue, could you let me know what I might be doing
>>> wrong, or what the next info-gathering step should be for me?
>>>
>>> Thank you,
>>>
>>> Simon Smith
>>> Arcode Corporation
>>>
>>
>

Mime
View raw message