cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Simon Smith (JIRA)" <>
Subject [jira] Commented: (CASSANDRA-440) get_key_range problems when a node is down
Date Mon, 14 Sep 2009 22:37:57 GMT


Simon Smith commented on CASSANDRA-440:


I tried out the patch you attached above, I applied it to 0.4, and it works for me.  Now,
as soon as I take a node down, there may be one or two seconds of the thrift-internal error,
the timeout (which I totally expect, and this is obviously OK) but as soon as the host doing
the querying can see the node is down, the error stops, and valid output is given by the get_key_range
query again.  And there isn't any disruption when the node comes back up.


Simon Smith

> get_key_range problems when a node is down
> ------------------------------------------
>                 Key: CASSANDRA-440
>                 URL:
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>    Affects Versions: 0.4, 0.5
>         Environment: 64-bit 4GB Rackspace-cloud boxes running FC11 (saw problem on 32-bit
platform as well)
>            Reporter: Simon Smith
>            Assignee: Jonathan Ellis
>         Attachments: 440.patch
> I'm running Cassandra on 5 nodes using the
> OrderPreservingPartitioner, and have populated Cassandra with 78
> records, and I can use get_key_range via Thrift just fine.  Then, if I
> manually kill one of the nodes (if I kill off node #5), the node (node
> #1) which I've been using to call get_key_range will timeout and the
> error:
>  Thrift: Internal error processing get_key_range
> The Cassandra output traceback:
> ERROR - Encountered IOException on connection:
> java.nio.channels.SocketChannel[closed]
> Connection refused
>        at Method)
>        at
>        at
>        at
>        at
> WARN - Closing down connection java.nio.channels.SocketChannel[closed]
> ERROR - Internal error processing get_key_range
> java.lang.RuntimeException: java.util.concurrent.TimeoutException:
> Operation timed out.
>        at org.apache.cassandra.service.StorageProxy.getKeyRange(
>        at org.apache.cassandra.service.CassandraServer.get_key_range(
>        at org.apache.cassandra.service.Cassandra$Processor$get_key_range.process(
>        at org.apache.cassandra.service.Cassandra$Processor.process(
>        at org.apache.thrift.server.TThreadPoolServer$
>        at java.util.concurrent.ThreadPoolExecutor.runWorker(
>        at java.util.concurrent.ThreadPoolExecutor$
>        at
> Caused by: java.util.concurrent.TimeoutException: Operation timed out.
>        at
>        at org.apache.cassandra.service.StorageProxy.getKeyRange(
>        ... 7 more
> The error starts as soon as the downed node #5 goes down and lasts
> until I restart the downed node #5.
> bin/nodeprobe cluster is accurate (it knows quickly when #5 is down,
> and when it is up again)
> Since I set the replication set to 3, I'm confused as to why (after
> the first few seconds or so) there is an error just because one host
> is down temporarily.
> (Jonathan Ellis and I discussed this on the mailing list, let me know if more information
is needed.)

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message