cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Gary Dusbabek (JIRA)" <j...@apache.org>
Subject [jira] Issue Comment Edited: (CASSANDRA-1467) replication factor exceeds number of endpoints, when attempting to join a new node (but cluster has enough running nodes to fulfill RF)
Date Thu, 09 Sep 2010 18:48:32 GMT

    [ https://issues.apache.org/jira/browse/CASSANDRA-1467?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12907718#action_12907718
] 

Gary Dusbabek edited comment on CASSANDRA-1467 at 9/9/10 2:46 PM:
------------------------------------------------------------------

The first two patches are housekeeping.  Exposing the endpoint states via jmx was helpful
for troubleshooting.
The third patch changes gossip so that the removetoken information is broadcast as part of
a MOVE,NORMAL instead of MOVE,LEFT.  This means that we no longer have to distinguish between
MOVE,LEFT,leaving and MOVE,LEFT,removetoken, so I removed that distinction.  

Valid MOVE operations are now these:
MOVE,NORMAL,token[,<other command>,<other token>]
MOVE,LEFT,token
MOVE,BOOTSTRAPPING,token
MOVE,LEAVING,token

If this patch goes into 0.6 as is, it alters gossip enough so that if a removetoken, decommission
or loadbalance are in progress during the rolling upgrade, bad things might happen (the wrong
handlers could be called).

      was (Author: gdusbabek):
    The first two patches are housekeeping.  Exposing the endpoint states via jmx was helpful
for troubleshooting.
The third page changes gossip so that the removetoken information is broadcast as part of
a MOVE,NORMAL instead of MOVE,LEFT.  This means that we no longer have to distinguish between
MOVE,LEFT,leaving and MOVE,LEFT,removetoken, so I removed that distinction.  

Valid MOVE operations are now these:
MOVE,NORMAL,token[,<other command>,<other token>]
MOVE,LEFT,token
MOVE,BOOTSTRAPPING,token
MOVE,LEAVING,token

If this patch goes into 0.6 as is, it alters gossip enough so that if a removetoken, decommission
or loadbalance are in progress during the rolling upgrade, bad things might happen (the wrong
handlers could be called).
  
> replication factor exceeds number of endpoints, when attempting to join a new node (but
cluster has enough running nodes to fulfill RF)
> ---------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-1467
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-1467
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>    Affects Versions: 0.6
>         Environment: I'm using 0.7 beta1. Built from: http://www.apache.org/dyn/closer.cgi?path=/cassandra/0.7.0/apache-cassandra-0.7.0-beta1-src.tar.gz
 using dpkg-buildpackage from it.
>            Reporter: Josep M. Blanquer
>            Assignee: Gary Dusbabek
>             Fix For: 0.6.6, 0.7 beta 2
>
>         Attachments: 1467-0.6.txt, v0-0001-expose-endpoint-states-to-jmx.txt, v0-0002-remove-unused-code-from-SLB.txt,
v0-0003-broadcast-removetoken-using-MOVE-NORMAL-to-preserve-th.txt
>
>
> What happens here the following:
> * given a healthy running cluster of 2 nodes (it used to be 3 but I manually killed one
down)
> * with a Keyspace having a ReplicationFactor of 2
> * (this cluster is operational, loaded with data and working well)
>  as soon as I want to bring up a new 3rd node:
> * the node is detected by the current cluster
> * but as soon as it tries to initiate bootstrap sequence it dies with:  replication factor
(2) exceeds number of endpoints (1)
> I believe this is similar to #1343, but not quite the same, since the keyspace and everything
is already created, and I'm not attempting any modification. This is purely bringing a new
node up.
> One extra tidbit of information, in case it's important...maybe it is not: 
> I have only set 1 seed node configured (this is a test setup). And when the new node
starts coming up (before the crash) its nodetool ring only reports the seed one.
> From the new node coming up (before it crashes):
> root@node1-3:~# nodetool -h localhost -p 8080 ring
> Address         Status State   Load            Token                                
      
> 10.250.106.111  Up     Normal  116.3 GB        0          < = this the only configured
seed node
> From one of the other running nodes (this is the real status of the cluster):
> root@node1-2:~# nodetool -h localhost -p 8080 ring
> Address         Status State   Load            Token                                
      
>                                        113416112894748789872342756657008344878    
> 10.250.106.111  Up     Normal  116.3 GB        0                                    
      
> 10.215.195.81   Up     Normal  116.31 GB       56713727820156410577229101238628035242
     
> 10.246.65.221   Up     Joining 7.63 KB         113416112894748789872342756657008344878
    
> Here's the full boot trace of the node is trying to join:
> Java HotSpot(TM) Client VM warning: Can't detect initial thread stack location - find_vma
failed
> Create RMI registry on port 8081
> Get the platform's MBean server
> Initialize the environment map
> Create an RMI connector server
> Start the RMI connector server on port 8081
>  INFO 22:32:50,448 Loading settings from /etc/cassandra/cassandra.yaml
> DEBUG 22:32:50,537 Syncing log with a period of 10000
>  INFO 22:32:50,538 DiskAccessMode 'auto' determined to be standard, indexAccessMode is
standard
> DEBUG 22:32:50,548 setting auto_bootstrap to true
> DEBUG 22:32:50,702 Starting CFS Statistics
> DEBUG 22:32:50,710 key cache capacity for Statistics is 1
> DEBUG 22:32:50,711 Starting CFS Schema
> DEBUG 22:32:50,712 key cache capacity for Schema is 1
> DEBUG 22:32:50,712 Starting CFS Migrations
> DEBUG 22:32:50,713 key cache capacity for Migrations is 1
> DEBUG 22:32:50,713 Starting CFS LocationInfo
> DEBUG 22:32:50,713 key cache capacity for LocationInfo is 1
> DEBUG 22:32:50,714 Starting CFS HintsColumnFamily
> DEBUG 22:32:50,714 key cache capacity for HintsColumnFamily is 1
>  INFO 22:32:50,736 Couldn't detect any schema definitions in local storage.
>  INFO 22:32:50,737 Found table data in data directories. Consider using JMX to call org.apache.cassandra.service.StorageService.loadSchemaFromYaml().
> DEBUG 22:32:50,738 opening keyspace system
>  INFO 22:32:50,757 Cassandra version: 
>  INFO 22:32:50,757 Thrift API version: 10.0.0
>  INFO 22:32:50,758 Saved Token not found. Using 113416112894748789872342756657008344878
>  INFO 22:32:50,758 Saved ClusterName not found. Using SOMETHING 
>  INFO 22:32:50,763 Creating new commitlog segment /mnt/cassandra/commitlog/CommitLog-1283553170763.log
> DEBUG 22:32:50,767 Estimating compactions for LocationInfo
> DEBUG 22:32:50,768 Estimating compactions for HintsColumnFamily
> DEBUG 22:32:50,768 Estimating compactions for Migrations
> DEBUG 22:32:50,768 Estimating compactions for Schema
> DEBUG 22:32:50,768 Estimating compactions for Statistics
> DEBUG 22:32:50,769 Checking to see if compaction of LocationInfo would be useful
> DEBUG 22:32:50,769 Checking to see if compaction of HintsColumnFamily would be useful
> DEBUG 22:32:50,769 Checking to see if compaction of Migrations would be useful
> DEBUG 22:32:50,769 Checking to see if compaction of Schema would be useful
> DEBUG 22:32:50,769 Checking to see if compaction of Statistics would be useful
>  INFO 22:32:50,779 switching in a fresh Memtable for LocationInfo at CommitLogContext(file='/mnt/cassandra/commitlog/CommitLog-1283553170763.log',
position=276)
>  INFO 22:32:50,782 Enqueuing flush of Memtable-LocationInfo@18721294(192 bytes, 4 operations)
>  INFO 22:32:50,783 Writing Memtable-LocationInfo@18721294(192 bytes, 4 operations)
>  INFO 22:32:50,897 Completed flushing /mnt/ebs/data/system/LocationInfo-e-1-Data.db
> DEBUG 22:32:50,898 Checking to see if compaction of LocationInfo would be useful
> DEBUG 22:32:50,898 Discarding 0
> DEBUG 22:32:50,899 discard completed log segments for CommitLogContext(file='/mnt/cassandra/commitlog/CommitLog-1283553170763.log',
position=276), column family 0.
> DEBUG 22:32:50,899 Marking replay position 276 on commit log CommitLogSegment(/mnt/cassandra/commitlog/CommitLog-1283553170763.log)
>  INFO 22:32:50,908 Starting up server gossip
>  INFO 22:32:50,928 Joining: getting load information
>  INFO 22:32:50,928 Sleeping 90000 ms to wait for load information...
> DEBUG 22:32:50,940 attempting to connect to node1-1.domain.com/10.250.106.111
> DEBUG 22:32:51,044 attempting to connect to node1-1.domain.com/10.250.106.111
> DEBUG 22:32:51,045 attempting to connect to /10.215.195.81
>  INFO 22:32:51,051 Node /10.215.195.81 is now part of the cluster
> DEBUG 22:32:51,051 Resetting pool for /10.215.195.81
> DEBUG 22:32:51,052 Token 113416112894748789872342756657008344877 removed manually (endpoint
was unknown)
>  INFO 22:32:51,052 Node /10.250.106.111 is now part of the cluster
> DEBUG 22:32:51,053 Resetting pool for /10.250.106.111
> DEBUG 22:32:51,053 Node /10.250.106.111 state normal, token 0
> DEBUG 22:32:51,053 clearing cached endpoints
> DEBUG 22:32:51,171 Applying AddKeyspace from /10.250.106.111
> DEBUG 22:32:51,188 Applying migration 77e97d6c-b625-11df-8596-318df8b646e8
>  INFO 22:32:51,189 switching in a fresh Memtable for Migrations at CommitLogContext(file='/mnt/cassandra/commitlog/CommitLog-1283553170763.log',
position=7417)
>  INFO 22:32:51,189 Enqueuing flush of Memtable-Migrations@18093512(4938 bytes, 1 operations)
>  INFO 22:32:51,189 Writing Memtable-Migrations@18093512(4938 bytes, 1 operations)
>  INFO 22:32:51,194 switching in a fresh Memtable for Schema at CommitLogContext(file='/mnt/cassandra/commitlog/CommitLog-1283553170763.log',
position=7417)
>  INFO 22:32:51,194 Enqueuing flush of Memtable-Schema@27402470(1784 bytes, 3 operations)
>  INFO 22:32:51,274 Completed flushing /mnt/ebs/data/system/Migrations-e-1-Data.db
> DEBUG 22:32:51,274 Checking to see if compaction of Migrations would be useful
> DEBUG 22:32:51,275 Discarding 2
> DEBUG 22:32:51,275 discard completed log segments for CommitLogContext(file='/mnt/cassandra/commitlog/CommitLog-1283553170763.log',
position=7417), column family 2
> .
> DEBUG 22:32:51,275 Marking replay position 7417 on commit log CommitLogSegment(/mnt/cassandra/commitlog/CommitLog-1283553170763.log)
>  INFO 22:32:51,275 Writing Memtable-Schema@27402470(1784 bytes, 3 operations)
>  INFO 22:32:51,388 Completed flushing /mnt/ebs/data/system/Schema-e-1-Data.db
> DEBUG 22:32:51,389 Checking to see if compaction of Schema would be useful
> DEBUG 22:32:51,389 Discarding 3
> DEBUG 22:32:51,389 discard completed log segments for CommitLogContext(file='/mnt/cassandra/commitlog/CommitLog-1283553170763.log',
position=7417), column family 3
> .
> DEBUG 22:32:51,389 Marking replay position 7417 on commit log CommitLogSegment(/mnt/cassandra/commitlog/CommitLog-1283553170763.log)
> DEBUG 22:32:51,406 Starting CFS MyColumnFamily
> DEBUG 22:32:51,406 key cache capacity for MyColumnFamily is 200000
>  INFO 22:32:51,407 Creating new commitlog segment /mnt/cassandra/commitlog/CommitLog-1283553171407.log
> DEBUG 22:32:51,604 attempting to connect to node1-1.domain.com/10.250.106.111
>  INFO 22:32:51,622 InetAddress /10.215.195.81 is now UP
>  INFO 22:32:51,623 InetAddress /10.250.106.111 is now UP
>  INFO 22:32:51,623 Started hinted handoff for endpoint /10.215.195.81
>  INFO 22:32:51,631 Finished hinted handoff of 0 rows to endpoint /10.215.195.81
>  INFO 22:32:51,632 Started hinted handoff for endpoint /10.250.106.111
>  INFO 22:32:51,632 Finished hinted handoff of 0 rows to endpoint /10.250.106.111
> DEBUG 22:32:51,918 GC for ParNew: 14 ms, 20561696 reclaimed leaving 149257720 used; max
is 902627328
> DEBUG 22:32:51,919 GC for ConcurrentMarkSweep: 73 ms, 4178480 reclaimed leaving 8520856
used; max is 902627328
> DEBUG 22:32:52,924 Disseminating load info ...
> DEBUG 22:32:53,929 attempting to connect to /10.215.195.81
> DEBUG 22:32:54,925 GC for ConcurrentMarkSweep: 73 ms, 212034048 reclaimed leaving 15214608
used; max is 902627328
> DEBUG 22:33:52,931 Disseminating load info ...
> DEBUG 22:34:20,929 ... got load info
>  INFO 22:34:20,929 Joining: getting bootstrap token
> DEBUG 22:34:20,931 token manually specified as 113416112894748789872342756657008344878
>  INFO 22:34:20,932 Joining: sleeping 30000 ms for pending range setup
>  INFO 22:34:50,935 Bootstrapping
> DEBUG 22:34:50,935 Beginning bootstrap process
> java.lang.reflect.InvocationTargetException
>         at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>         at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>         at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>         at java.lang.reflect.Method.invoke(Method.java:597)
>         at org.apache.commons.daemon.support.DaemonLoader.load(DaemonLoader.java:160)
> Caused by: java.lang.IllegalStateException: replication factor (2) exceeds number of
endpoints (1)
>         at org.apache.cassandra.locator.RackUnawareStrategy.calculateNaturalEndpoints(RackUnawareStrategy.java:57)
>         at org.apache.cassandra.locator.AbstractReplicationStrategy.getRangeAddresses(AbstractReplicationStrategy.java:195)
>         at org.apache.cassandra.dht.BootStrapper.getRangesWithSources(BootStrapper.java:155)
>         at org.apache.cassandra.dht.BootStrapper.startBootstrap(BootStrapper.java:73)
>         at org.apache.cassandra.service.StorageService.startBootstrap(StorageService.java:467)
>         at org.apache.cassandra.service.StorageService.initServer(StorageService.java:408)
>         at org.apache.cassandra.thrift.CassandraDaemon.setup(CassandraDaemon.java:134)
>         at org.apache.cassandra.service.AbstractCassandraDaemon.init(AbstractCassandraDaemon.java:57)
>         ... 5 more
> Cannot load daemon
> Service exit with a return value of 3

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message