cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Brandon Williams (JIRA)" <j...@apache.org>
Subject [jira] Commented: (CASSANDRA-1463) Failed bootstrap can cause NPE in batch_mutate on every node, taking down the entire cluster
Date Fri, 03 Sep 2010 20:56:33 GMT

    [ https://issues.apache.org/jira/browse/CASSANDRA-1463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12906087#action_12906087
] 

Brandon Williams commented on CASSANDRA-1463:
---------------------------------------------

Fixed in 0.7 by CASSANDRA-757, but the approach we took for 0.6 was CASSANDRA-1289.  My guess
is so many batch_mutate errors were being logged, logging consumed all the cpu before the
gossiper timer could run again, which would have solved it.  I'm not sure how to solve this
in 0.6 in a less invasive way than the 0.7 approach.

> Failed bootstrap can cause NPE in batch_mutate on every node, taking down the entire
cluster
> --------------------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-1463
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-1463
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>    Affects Versions: 0.6.5
>            Reporter: David King
>
> In adding a node to the cluster, the bootstrap failed (still investigating the cause).
An hour later, the entire cluster failed, preventing any writes from being accepted. This
exception started being printed to the logs:
> {quote}
>  INFO [Timer-0] 2010-09-03 12:23:33,282 Gossiper.java (line 402) FatClient /10.251.243.191
has been silent for 3600000ms, removing from gossip
> ERROR [Timer-0] 2010-09-03 12:23:33,318 Gossiper.java (line 99) Gossip error
> java.util.ConcurrentModificationException
>         at java.util.Hashtable$Enumerator.next(Hashtable.java:1048)
>         at org.apache.cassandra.gms.Gossiper.doStatusCheck(Gossiper.java:383)
>         at org.apache.cassandra.gms.Gossiper$GossipTimerTask.run(Gossiper.java:93)
>         at java.util.TimerThread.mainLoop(Timer.java:534)
>         at java.util.TimerThread.run(Timer.java:484)
> ERROR [pool-1-thread-69153] 2010-09-03 12:23:33,857 Cassandra.java (line 1659) Internal
error processing batch_mutate
> java.lang.NullPointerException
>         at org.apache.cassandra.gms.FailureDetector.isAlive(FailureDetector.java:135)
>         at org.apache.cassandra.locator.AbstractReplicationStrategy.getHintedEndpoints(AbstractReplicationStrategy.java:85)
>         at org.apache.cassandra.service.StorageProxy.mutateBlocking(StorageProxy.java:204)
>         at org.apache.cassandra.thrift.CassandraServer.batch_mutate(CassandraServer.java:415)
>         at org.apache.cassandra.thrift.Cassandra$Processor$batch_mutate.process(Cassandra.java:1651)
>         at org.apache.cassandra.thrift.Cassandra$Processor.process(Cassandra.java:1166)
>         at org.apache.cassandra.thrift.CustomTThreadPoolServer$WorkerProcess.run(CustomTThreadPoolServer.java:167)
>         at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
>         at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
>         at java.lang.Thread.run(Thread.java:636)
> ERROR [pool-1-thread-69154] 2010-09-03 12:23:33,869 Cassandra.java (line 1659) Internal
error processing batch_mutate
> java.lang.NullPointerException
>         at org.apache.cassandra.gms.FailureDetector.isAlive(FailureDetector.java:135)
>         at org.apache.cassandra.locator.AbstractReplicationStrategy.getHintedEndpoints(AbstractReplicationStrategy.java:85)
>         at org.apache.cassandra.service.StorageProxy.mutateBlocking(StorageProxy.java:204)
>         at org.apache.cassandra.thrift.CassandraServer.batch_mutate(CassandraServer.java:415)
>         at org.apache.cassandra.thrift.Cassandra$Processor$batch_mutate.process(Cassandra.java:1651)
>         at org.apache.cassandra.thrift.Cassandra$Processor.process(Cassandra.java:1166)
>         at org.apache.cassandra.thrift.CustomTThreadPoolServer$WorkerProcess.run(CustomTThreadPoolServer.java:167)
>         at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
>         at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
>         at java.lang.Thread.run(Thread.java:636)
> {quote}
> After a large number of iterations of that (at least thousands), the printed exception
was shortened (this shortening is what made me mistakenly file #1462) to
> {quote}
> ERROR [pool-1-thread-68869] 2010-09-03 12:39:22,857 Cassandra.java (line 1659) Internal
error processing batch_mutate
> java.lang.NullPointerException
> ERROR [pool-1-thread-68869] 2010-09-03 12:39:22,883 Cassandra.java (line 1659) Internal
error processing batch_mutate
> java.lang.NullPointerException
> ERROR [pool-1-thread-68869] 2010-09-03 12:39:22,894 Cassandra.java (line 1659) Internal
error processing batch_mutate
> java.lang.NullPointerException
> ERROR [pool-1-thread-68970] 2010-09-03 12:39:22,985 Cassandra.java (line 1659) Internal
error processing batch_mutate
> java.lang.NullPointerException
> ERROR [pool-1-thread-68970] 2010-09-03 12:39:23,084 Cassandra.java (line 1659) Internal
error processing batch_mutate
> java.lang.NullPointerException
> {quote}
> Rolling a restart over the cluster fixed it, but every node had to be restarted before
it started accepting writes again.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message