incubator-cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jason Horman <jhor...@gmail.com>
Subject Re: Cassandra lockup (0.6.5) on bulk write
Date Tue, 05 Oct 2010 23:08:54 GMT
Yes, you are right. For some reason we didn't notice it. The Cassandra
server itself was up still, so the on machine monitoring for the
process didn't go off. nodetool shows it is unresponsive though. We
are still learning how to properly monitor. The machine that went down
ran out of disk space on Amazon EBS.

So I believe that our client was connected to that machine when it ran
into trouble. I am a little surprised that it wasn't disconnected, it
just hung forever. We are using Pelops, which doesn't seem to set a
thrift timeout. It also sets keep alive on the socket. Here is the
pelops connection code.

socket = new TSocket(nodeContext.node, port);
socket.getSocket().setKeepAlive(true);

Server side the default rpc timeout is used.
<RpcTimeoutInMillis>10000</RpcTimeoutInMillis>

Is RpcTimeoutInMillis supposed to have booted our client after 10s, or
is the server now just in a really bad state. Should I modify Pelops
to set a timeout on the TSocket. Is setKeepAlive recommended.

We are writing at consistency level ONE, replication factor is 4.
There are 5 cassandra servers at the moment but in production we will
run with more. This is on Amazon EC2/EBS so IO performance isn't
great. I think that the cluster appears unbalanced b/c of the high
replication factor.

If you are interested here is the stack trace from the machine that
ran out of space.

ERROR [COMMIT-LOG-WRITER] 2010-10-05 16:05:22,393 CassandraDaemon.java
(line 83) Uncaught exception in thread
Thread[COMMIT-LOG-WRITER,5,main]
java.lang.RuntimeException: java.lang.RuntimeException:
java.io.IOException: No space left on device
       at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:34)
       at java.lang.Thread.run(Thread.java:619)
Caused by: java.lang.RuntimeException: java.io.IOException: No space
left on device
       at org.apache.cassandra.db.commitlog.BatchCommitLogExecutorService.processWithSyncBatch(BatchCommitLogExecutorService.java:102)
       at org.apache.cassandra.db.commitlog.BatchCommitLogExecutorService.access$000(BatchCommitLogExecutorService.java:31)
       at org.apache.cassandra.db.commitlog.BatchCommitLogExecutorService$1.runMayThrow(BatchCommitLogExecutorService.java:49)
       at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:30)
       ... 1 more
Caused by: java.io.IOException: No space left on device
       at java.io.RandomAccessFile.writeBytes(Native Method)
       at java.io.RandomAccessFile.write(RandomAccessFile.java:466)
       at org.apache.cassandra.io.util.BufferedRandomAccessFile.flushBuffer(BufferedRandomAccessFile.java:193)
       at org.apache.cassandra.io.util.BufferedRandomAccessFile.sync(BufferedRandomAccessFile.java:173)
       at org.apache.cassandra.db.commitlog.CommitLogSegment.sync(CommitLogSegment.java:142)
       at org.apache.cassandra.db.commitlog.CommitLog.sync(CommitLog.java:424)
       at org.apache.cassandra.db.commitlog.BatchCommitLogExecutorService.processWithSyncBatch(BatchCommitLogExecutorService.java:98)
       ... 4 more
ERROR [COMPACTION-POOL:1] 2010-10-05 16:05:40,366 CassandraDaemon.java
(line 83) Uncaught exception in thread
Thread[COMPACTION-POOL:1,5,main]
java.util.concurrent.ExecutionException: java.io.IOException: No space
left on device
       at java.util.concurrent.FutureTask$Sync.innerGet(FutureTask.java:222)
       at java.util.concurrent.FutureTask.get(FutureTask.java:83)
       at org.apache.cassandra.concurrent.DebuggableThreadPoolExecutor.afterExecute(DebuggableThreadPoolExecutor.java:86)
       at org.apache.cassandra.db.CompactionManager$CompactionExecutor.afterExecute(CompactionManager.java:577)
       at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:888)
       at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
       at java.lang.Thread.run(Thread.java:619)
Caused by: java.io.IOException: No space left on device
       at java.io.RandomAccessFile.writeBytes(Native Method)
       at java.io.RandomAccessFile.write(RandomAccessFile.java:466)
       at org.apache.cassandra.io.util.BufferedRandomAccessFile.flushBuffer(BufferedRandomAccessFile.java:193)
       at org.apache.cassandra.io.util.BufferedRandomAccessFile.seek(BufferedRandomAccessFile.java:239)
       at org.apache.cassandra.io.util.BufferedRandomAccessFile.writeAtMost(BufferedRandomAccessFile.java:390)
       at org.apache.cassandra.io.util.BufferedRandomAccessFile.write(BufferedRandomAccessFile.java:366)
       at org.apache.cassandra.io.SSTableWriter.append(SSTableWriter.java:100)
       at org.apache.cassandra.db.CompactionManager.doCompaction(CompactionManager.java:300)
       at org.apache.cassandra.db.CompactionManager$1.call(CompactionManager.java:102)
       at org.apache.cassandra.db.CompactionManager$1.call(CompactionManager.java:83)
       at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
       at java.util.concurrent.FutureTask.run(FutureTask.java:138)
       at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
       ... 2 more


On Tue, Oct 5, 2010 at 3:43 PM, Aaron Morton <aaron@thelastpickle.com> wrote:
> The cluster looks unbalanced (assuming the Random Partitioner), did you
> manually assign tokens to the nodes?  The section on Token Select here some
> some tips http://wiki.apache.org/cassandra/Operations
> One of the nodes in the cluster is down. Is there anything in the log to
> explain why ? You may have some other errors.
> Also want to check:
> - your client has a list of all of the clients, so it could move to another
> if it was connected to the down node.
> - what's the RF and what consistency level are you writing at.
> - how long is the hang?
> - what happening on the server while the client is hanging? e.g. is it idle
> or is the CPU going crazy, swapping, iostat
> - what timeout are you using with thrift?
>
> Aaron
> On 06 Oct, 2010,at 07:28 AM, Jason Horman <jhorman@gmail.com> wrote:
>
> We are experiencing some random hangs while importing data into
> Cassandra 0.6.5. The client stack dump is below. We are using Java
> Pelops with Thrift r917130. The hang seems random, sometimes millions
> of records in, sometimes just a few thousand. It sort of smells like
> the JIRA
>
> https://issues.apache.org/jira/browse/CASSANDRA-1175
>
> Has any one else experienced this? Any advice?
>
> Here is a dump from nodetool
>
> Address Status Load Range
> Ring
> 10.192.230.224Down 43.41 GB
> 25274261893111669883290654807978388961 |<--|
> 10.248.135.223Up 29.38 GB
> 34662916595519283353151730886201323030 | ^
> 10.209.125.235Up 19.83 GB
> 45387569059876439228162547977665761954 v |
> 10.206.209.112Up 23.59 GB
> 105389616365686887162471812716889564402 | ^
> 10.209.22.3 Up 33.16 GB
> 148562884084359545011181864444489491335 |-->|
>
> Here is the stack
>
> "RMI TCP Connection(4)-10.246.55223" daemon prio=10
> tid=0x00002aaac0194000 nid=0x53b3 runnable [0x000000004b7dc000]
>    java.lang.Thread.State: RUNNABLE
> at java.net.SocketInputStream.socketRead0(Native Method)
> at java.net.SocketInputStream.read(SocketInputStream.java:129)
> at java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
> at java.io.BufferedInputStream.read1(BufferedInputStream.java:258)
> at java.io.BufferedInputStream.read(BufferedInputStream.java:317)
> - locked <0x000000074d23e978> (a java.io.BufferedInputStream)
> at
> org.apache.thrift.transport.TIOStreamTransport.read(TIOStreamTransport.java:126)
> at org.apache.thrift.transport.TTransport.readAll(TTransport.java:84)
> at
> org.apache.thrifttransport.TFramedTransport.readFrame(TFramedTransport.java:92)
> at
> org.apache.thrift.transport.TFramedTransport.read(TFramedTransport.java:85)
> at org.apache.thrift.transport.TTransport.readAll(TTransport.java:84)
> at
> org.apache.thrift.protocol.TBinaryProtocol.readAll(TBinaryProtocol.java:314)
> at
> org.apache.thrift.protocol.TBinaryProtocol.readI32(TBinaryProtocol.java:262)
> at
> org.apache.thrift.protocol.TBinaryProtocol.readMessageBegin(TBinaryProtocol.java:192)
> at
> org.apache.cassandra.thrift.Cassandra$Client.recv_batch_mutate(Cassandra.java:794)
> at
> org.apache.cassandra.thrift.Cassandra$Client.batch_mutate(Cassandra.java:777)
> at org.wyki.cassandra.pelops.Mutator$1.execute(Mutator.java:40)
>



-- 
-jason

Mime
View raw message