cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Anthony Molinaro <antho...@alumni.caltech.edu>
Subject Re: Large number of ROW-READ-STAGE pending tasks?
Date Fri, 08 Jan 2010 22:57:38 GMT
So it seems to correlate with writes, the machines with pending tasks
in their MESSAGE-SERIALIZER-POOL also have a high number of write counts,
so my keyspace is probably out of balance.  Hopefully the tools available
in 0.5 will allow me to move keys around to make things a little more
evenly distributed.  I believe they do, but I need to upgrade first :/

-Anthony

On Fri, Jan 08, 2010 at 02:48:39PM -0800, Anthony Molinaro wrote:
> So I restarted the node with the large number of ROW-READ-STAGE pending
> tasks, the timeouts are still occuring somewhat randomly, and now
> MESSAGE-SERIALIZER-POOL seems to be growing on one of the nodes
> 
> % for h in 02 03 04 05 06 07 08 09 ; do echo "xtr-$h.mkt";  cassandra-nodeprobe -host
xtr-$h.mkt -port 8080 tpstats | grep -v tasks=0 ; done
> xtr-02.mkt
> MINOR-COMPACTION-POOL, pending tasks=5
> xtr-03.mkt
> ROW-MUTATION-STAGE, pending tasks=3
> xtr-04.mkt
> MINOR-COMPACTION-POOL, pending tasks=11
> xtr-05.mkt
> xtr-06.mkt
> xtr-07.mkt
> MESSAGING-SERVICE-POOL, pending tasks=4
> MESSAGE-SERIALIZER-POOL, pending tasks=108
> xtr-08.mkt
> MESSAGING-SERVICE-POOL, pending tasks=2
> MESSAGE-SERIALIZER-POOL, pending tasks=468
> ROW-MUTATION-STAGE, pending tasks=2
> xtr-09.mkt
> ROW-MUTATION-STAGE, pending tasks=1
> 
> ...So I watched for some amount of time, and the MESSAGE-SERIALIZER-POOL
> seems to go up and down quite a bit on just that one box.  Also still
> seeing loads of timeouts on many of the boxes.  Still seems like something
> might be misbehaving?   Also, since the above the COMPACTION POOL has
> reached zero on xtr-04, but not on xtr-02, does that seem odd?
> 
> -Anthony
> 
> On Fri, Jan 08, 2010 at 03:21:12PM -0600, Jonathan Ellis wrote:
> > if the queued reads is increasing then you're going to OOM eventually,
> > and it will probably freeze (to the clients' perspective) first while
> > it desperately tries to GC enough to continue.  i would restart the
> > affected nodes.
> > 
> > On Fri, Jan 8, 2010 at 3:15 PM, Anthony Molinaro
> > <anthonym@alumni.caltech.edu> wrote:
> > > Hi, I had one of my machines fail last night (OOM), and upon restarting it
> > > about 12 hours later (have to get me some monitoring so I can restart it
> > > faster), I've noticed lots of errors like
> > >
> > > ERROR [pool-1-thread-6915] 2010-01-08 21:10:59,902 Cassandra.java (line 739)
Internal error processing multiget_slice
> > > java.lang.RuntimeException: error reading key 3cd4e4ba-2fb6-446a-9dc5-96bd6737dddf
> > >        at org.apache.cassandra.service.StorageProxy.weakReadRemote(StorageProxy.java:265)
> > >        at org.apache.cassandra.service.StorageProxy.readProtocol(StorageProxy.java:312)
> > >        at org.apache.cassandra.service.CassandraServer.readColumnFamily(CassandraServer.java:100)
> > >        at org.apache.cassandra.service.CassandraServer.getSlice(CassandraServer.java:182)
> > >        at org.apache.cassandra.service.CassandraServer.multigetSliceInternal(CassandraServer.java:251)
> > >        at org.apache.cassandra.service.CassandraServer.multiget_slice(CassandraServer.java:228)
> > >        at org.apache.cassandra.service.Cassandra$Processor$multiget_slice.process(Cassandra.java:733)
> > >        at org.apache.cassandra.service.Cassandra$Processor.process(Cassandra.java:627)
> > >        at org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:253)
> > >        at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
> > >        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
> > >        at java.lang.Thread.run(Thread.java:619)
> > > Caused by: java.util.concurrent.TimeoutException: Operation timed out.
> > >        at org.apache.cassandra.net.AsyncResult.get(AsyncResult.java:97)
> > >        at org.apache.cassandra.service.StorageProxy.weakReadRemote(StorageProxy.java:261)
> > >        ... 11 more
> > >
> > > On some of the nodes.  Using nodeprobe, I noticed on of the machines
> > > has a large and growing number of pending tasks
> > >
> > > % cassandra-nodeprobe -host xtr-04.mkt -port 8080 tpstats
> > > FILEUTILS-DELETE-POOL, pending tasks=0
> > > MESSAGING-SERVICE-POOL, pending tasks=0
> > > RESPONSE-STAGE, pending tasks=0
> > > MESSAGE-SERIALIZER-POOL, pending tasks=12
> > > BOOT-STRAPPER, pending tasks=0
> > > ROW-READ-STAGE, pending tasks=2006170
> > > COMMITLOG, pending tasks=8
> > > MESSAGE-DESERIALIZER-POOL, pending tasks=0
> > > GMFD, pending tasks=0
> > > LB-TARGET, pending tasks=0
> > > CONSISTENCY-MANAGER, pending tasks=1
> > > ROW-MUTATION-STAGE, pending tasks=130
> > > MINOR-COMPACTION-POOL, pending tasks=0
> > > MESSAGE-STREAMING-POOL, pending tasks=0
> > > LOAD-BALANCER-STAGE, pending tasks=0
> > > MEMTABLE-FLUSHER-POOL, pending tasks=0
> > >
> > > Does this indicate some sort of impending failure?  Would a restart of the
> > > node or the cluster fix things?  Will it eventually get better or should
> > > I stop the whole cluster and restart everything (this has worked in the
> > > past, but requires a bit of work to accomplish).
> > >
> > > This is cassandra 0.4.1 BTW.
> > >
> > > Thanks,
> > >
> > > -Anthony
> > >
> > > --
> > > ------------------------------------------------------------------------
> > > Anthony Molinaro                           <anthonym@alumni.caltech.edu>
> > >
> 
> -- 
> ------------------------------------------------------------------------
> Anthony Molinaro                           <anthonym@alumni.caltech.edu>

-- 
------------------------------------------------------------------------
Anthony Molinaro                           <anthonym@alumni.caltech.edu>

Mime
View raw message