Return-Path: X-Original-To: apmail-cassandra-user-archive@www.apache.org Delivered-To: apmail-cassandra-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id D55B16569 for ; Tue, 21 Jun 2011 00:41:33 +0000 (UTC) Received: (qmail 93962 invoked by uid 500); 21 Jun 2011 00:41:31 -0000 Delivered-To: apmail-cassandra-user-archive@cassandra.apache.org Received: (qmail 93930 invoked by uid 500); 21 Jun 2011 00:41:31 -0000 Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@cassandra.apache.org Delivered-To: mailing list user@cassandra.apache.org Received: (qmail 93922 invoked by uid 99); 21 Jun 2011 00:41:31 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 21 Jun 2011 00:41:31 +0000 X-ASF-Spam-Status: No, hits=2.6 required=5.0 tests=FREEMAIL_FROM,HTML_MESSAGE,NORMAL_HTTP_TO_IP,RCVD_IN_DNSWL_LOW,RFC_ABUSE_POST,SPF_PASS,TRACKER_ID,T_TO_NO_BRKTS_FREEMAIL,WEIRD_PORT X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of cassandralabs@gmail.com designates 209.85.214.172 as permitted sender) Received: from [209.85.214.172] (HELO mail-iw0-f172.google.com) (209.85.214.172) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 21 Jun 2011 00:41:27 +0000 Received: by iwn39 with SMTP id 39so1624021iwn.31 for ; Mon, 20 Jun 2011 17:41:06 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:in-reply-to:references:date :message-id:subject:from:to:content-type; bh=1cMzOzZWozCZrwOus8/BjqywUGqWSONCsMMwzI//UfM=; b=Ip/jzj8MRxv/StyHZScsgepouq5o34YBF9m20FfxU49aRGalgmpXGNWBkpX9+mdmYJ CUGLOXxqL8P0KECvdPTDnc73q1wrFXdPTeBl0qLySwo91IWNz62EvCByalwiUfpTKENY 5eKnYQjHGvxWt4aq1cLzQ0+TvEfoZalErx3G4= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; b=lMsZoPSFZ2bhsJUGzZM/5gf9iGADAtlgcNcWe63aCszQzVkP2z0i62WkzwGErKaUvc oEUTy/MF1TUcsodIXZIrlf+w4jaSOpcDgXtYv0GqL4cHtyr2FA57ortMc1if2gO+xxCk RtqqRwpsuEofVseG0tAPQX0QY66p3/cPhAc5o= MIME-Version: 1.0 Received: by 10.231.29.101 with SMTP id p37mr5905575ibc.3.1308616866254; Mon, 20 Jun 2011 17:41:06 -0700 (PDT) Received: by 10.231.145.136 with HTTP; Mon, 20 Jun 2011 17:41:06 -0700 (PDT) In-Reply-To: References: Date: Mon, 20 Jun 2011 17:41:06 -0700 Message-ID: Subject: Re: Problem with PropertyFileSnitch in Amazon EC2 From: Sameer Farooqui To: user@cassandra.apache.org Content-Type: multipart/alternative; boundary=00151773e34075cf6b04a62e1b0a --00151773e34075cf6b04a62e1b0a Content-Type: text/plain; charset=ISO-8859-1 Quick update... I'm trying to get a 3-node cluster defined the following way in the topology.properties file to work first: 10.68.x.x=DC1:RAC1 10.198.x.x=DC1:RAC2 10.204.x.x=DC1:RAC3 I'll split up the 3rd node into a separate data center later. Also, ignore that comment I made about the $BRISK_HOME/lib/ folder not existing. When you run ANT, I believe it populates correctly, but I'll have to confirm/test later. Based on Joaquin @ DataStax's suggestion, I tried changing the Seed IP in all 3 nodes' YAML file to the Amazon Private IP, instead of the Elastic IP. After this change, all three nodes joined the ring correctly: ubuntu@ip-10-68-x-x:~/brisk-1.0~beta1.2/resources/cassandra/conf$ ../bin/nodetool -h localhost ring Address Status State Load Owns Token 113427455640312821154458202477256070485 10.68.x.x Up Normal 10.9 KB 33.33% 0 10.198.x.x Up Normal 15.21 KB 33.33% 56713727820156410577229101238628035242 10.204.x.x Up Normal 6.55 KB 33.33% 113427455640312821154458202477256070485 PasteBin is down and is showing me a diligent cat typing on a keyboard, so I uploaded some relevant DEBUG level log files here: http://blueplastic.com/accenture/N1-system-seed_is_ElasticIP.log (problem exists) http://blueplastic.com/accenture/N2-system-seed_is_ElasticIP.log (problem exists) http://blueplastic.com/accenture/N1-system-seed_is_privateIP.log (everything works) http://blueplastic.com/accenture/N2-system-seed_is_privateIP.log (everything works) But if I want to set up the Brisk cluster across Amazon regions, I have to be able to use the Elastic IP for the seed. Also, using v 0.7.4 of Cassandra in Amazon, we successfully set up a 30+ node cluster using 3 seed nodes which were declared in the YAML file using Elastic IPs. All 30 nodes were in the same region and availability zone. So, in an older version of Cassandra, providing the Seeds as Elastic IP used to work. In my current setup, even though nodes 1 & 2 are in the same region & availability zone, I can't seem to get them to join the same ring correctly. Here is what the system log file shows when I declare the Seed using Elastic IP: INFO [Thread-4] 2011-06-21 00:10:30,849 BriskDaemon.java (line 187) Listening for thrift clients... DEBUG [GossipTasks:1] 2011-06-21 00:10:31,608 Gossiper.java (line 201) Assuming current protocol version for /50.17.x.x DEBUG [WRITE-/50.17.212.84] 2011-06-21 00:10:31,610 OutboundTcpConnection.java (line 161) attempting to connect to /50.17.x.x DEBUG [GossipTasks:1] 2011-06-21 00:10:32,610 Gossiper.java (line 201) Assuming current protocol version for /50.17.x.x DEBUG [ScheduledTasks:1] 2011-06-21 00:10:32,613 StorageLoadBalancer.java (line 334) Disseminating load info ... DEBUG [GossipTasks:1] 2011-06-21 00:10:33,611 Gossiper.java (line 201) Assuming current protocol version for /50.17.x.x DEBUG [GossipTasks:1] 2011-06-21 00:10:34,612 Gossiper.java (line 201) Assuming current protocol version for /50.17.x.x But when I use private IP, the log shows: INFO [Thread-4] 2011-06-21 00:19:47,993 BriskDaemon.java (line 187) Listening for thrift clients... DEBUG [ScheduledTasks:1] 2011-06-21 00:19:49,769 StorageLoadBalancer.java (line 334) Disseminating load info ... DEBUG [WRITE-/10.198.126.193] 2011-06-21 00:20:09,658 OutboundTcpConnection.java (line 161) attempting to connect to /10.198.x.x INFO [GossipStage:1] 2011-06-21 00:20:09,690 Gossiper.java (line 637) Node /10.198.x.x is now part of the cluster DEBUG [GossipStage:1] 2011-06-21 00:20:09,691 MessagingService.java (line 158) Resetting pool for /10.198.x.x INFO [GossipStage:1] 2011-06-21 00:20:09,691 Gossiper.java (line 605) InetAddress /10.198.x.x is now UP DEBUG [HintedHandoff:1] 2011-06-21 00:20:09,692 HintedHandOffManager.java (line 282) Checking remote schema before delivering hints DEBUG [HintedHandoff:1] 2011-06-21 00:20:09,692 HintedHandOffManager.java (line 274) schema for /10.198.x.x matches local schema DEBUG [HintedHandoff:1] 2011-06-21 00:20:09,692 HintedHandOffManager.java (line 288) Sleeping 11662ms to stagger hint delivery - Sameer On Mon, Jun 20, 2011 at 2:28 PM, Sameer Farooqui wrote: > Hi, > > I'm setting up a 3 node test cluster in multiple Amazon Availability Zones > to test cross-zone internode communication (and eventually cross-region > communications). > > But I wanted to start with a cross-zone setup and am having trouble getting > the nodes to connect to each other and join one 3-node ring. All nodes just > seem to join their own ring and claim 100% of that space. > > I'm using this Beta2 distribution of Brisk: > http://debian.datastax.com/maverick/pool/brisk_1.0~beta1.2.tar.gz > > I had to manually recreate the $BRISK_HOME/lib/ folder because it didn't > exist in the binary for some reason and I also added jna and mx4j jar files > to the lib directory. > > The cluster is geographically located like this: > > Node 1 (seed): East-A > Node 2: East-A > Node 3: East-B > > The cassandra-topology.properties file on all three nodes contains this: > > # Cassandra Node IP=Data Center:Rack > 10.68.x.x=DC1:RAC1 > 10.198.x.x=DC1:RAC2 > 10.204.x.x=DC2:RAC1 > default=DC1:RAC1 > > > and finally, here is what the relevant sections of the YAML file looks like > for each node: > > ++ Node 1 ++ > cluster_name: 'Test Cluster' > initial_token: 0 > auto_bootstrap: false > partitioner: org.apache.cassandra.dht.RandomPartitioner > - seeds: 50.17.x.x #This is the elastic IP for Node 1 > listen_address: 10.68.x.x > rpc_address: 0.0.0.0 > endpoint_snitch: org.apache.cassandra.locator.PropertyFileSnitch > encryption_options: > internode_encryption: none > > ++ Node 2 ++ > cluster_name: 'Test Cluster' > initial_token: 56713727820156410577229101238628035242 > auto_bootstrap: true > partitioner: org.apache.cassandra.dht.RandomPartitioner > - seeds: 50.17.x.x #This is the elastic IP for Node 1 > listen_address: 10.198.x.x > rpc_address: 0.0.0.0 > endpoint_snitch: org.apache.cassandra.locator.PropertyFileSnitch > encryption_options: > internode_encryption: none > > ++ Node 3 ++ > cluster_name: 'Test Cluster' > initial_token: 113427455640312821154458202477256070485 > auto_bootstrap: true > partitioner: org.apache.cassandra.dht.RandomPartitioner > - seeds: 50.17.x.x #This is the elastic IP for Node 1 > listen_address: 10.204.x.x > rpc_address: 0.0.0.0 > endpoint_snitch: org.apache.cassandra.locator.PropertyFileSnitch > encryption_options: > internode_encryption: none > > > When I start Cassandra on all three nodes using "sudo bin/brisk cassandra", > the startup log doesn't show any warnings or errors. The end of the start > log on Node1 says: > INFO [main] 2011-06-20 21:06:57,702 MessagingService.java (line 201) > Starting Messaging Service on port 7000 > INFO [main] 2011-06-20 21:06:57,723 StorageService.java (line 482) Using > saved token 0 > INFO [main] 2011-06-20 21:06:57,724 ColumnFamilyStore.java (line 1011) > Enqueuing flush of Memtable-LocationInfo@1260987126(38/47 serialized/live > bytes, 2 ops) > INFO [FlushWriter:1] 2011-06-20 21:06:57,724 Memtable.java (line 237) > Writing Memtable-LocationInfo@1260987126(38/47 serialized/live bytes, 2 > ops) > INFO [FlushWriter:1] 2011-06-20 21:06:57,809 Memtable.java (line 254) > Completed flushing /raiddrive/data/system/LocationInfo-g-12-Data.db (148 > bytes) > INFO [CompactionExecutor:2] 2011-06-20 21:06:57,812 CompactionManager.java > (line 539) Compacting Major: > [SSTableReader(path='/raiddrive/data/system/LocationInfo-g-9-Data.db'), > SSTableReader(path='/raiddrive/data/system/LocationInfo-g-11-Data.db'), > SSTableReader(path='/raiddrive/data/system/LocationInfo-g-10-Data.db'), > SSTableReader(path='/raiddrive/data/system/LocationInfo-g-12-Data.db')] > INFO [CompactionExecutor:2] 2011-06-20 21:06:57,828 > CompactionIterator.java (line 186) Major@1110828771(system, LocationInfo, > 429/808) now compacting at 16777 bytes/ms. > INFO [main] 2011-06-20 21:06:57,881 Mx4jTool.java (line 67) mx4j > successfuly loaded > INFO [CompactionExecutor:2] 2011-06-20 21:06:57,909 CompactionManager.java > (line 603) Compacted to > /raiddrive/data/system/LocationInfo-tmp-g-13-Data.db. 808 to 432 (~53% of > original) bytes for 3 keys. Time: 97ms. > INFO [main] 2011-06-20 21:06:57,953 BriskDaemon.java (line 146) Binding > thrift service to /0.0.0.0:9160 > INFO [main] 2011-06-20 21:06:57,955 BriskDaemon.java (line 160) Using > TFastFramedTransport with a max frame size of 15728640 bytes. > INFO [Thread-4] 2011-06-20 21:06:57,958 BriskDaemon.java (line 187) > Listening for thrift clients... > > > And the end of the log on node 2 says: > INFO [main] 2011-06-20 21:06:57,899 StorageService.java (line 368) > Cassandra version: 0.8.0-beta2-SNAPSHOT > INFO [main] 2011-06-20 21:06:57,901 StorageService.java (line 369) Thrift > API version: 19.10.0 > INFO [main] 2011-06-20 21:06:57,901 StorageService.java (line 382) Loading > persisted ring state > INFO [main] 2011-06-20 21:06:57,904 StorageService.java (line 418) > Starting up server gossip > INFO [main] 2011-06-20 21:06:57,915 ColumnFamilyStore.java (line 1011) > Enqueuing flush of Memtable-LocationInfo@885597447(29/36 serialized/live > bytes, 1 ops) > INFO [FlushWriter:1] 2011-06-20 21:06:57,916 Memtable.java (line 237) > Writing Memtable-LocationInfo@885597447(29/36 serialized/live bytes, 1 > ops) > INFO [FlushWriter:1] 2011-06-20 21:06:57,990 Memtable.java (line 254) > Completed flushing /raiddrive/data/system/LocationInfo-g-8-Data.db (80 > bytes) > INFO [CompactionExecutor:1] 2011-06-20 21:06:58,000 CompactionManager.java > (line 539) Compacting Major: > [SSTableReader(path='/raiddrive/data/system/LocationInfo-g-6-Data.db'), > SSTableReader(path='/raiddrive/data/system/LocationInfo-g-8-Data.db'), > SSTableReader(path='/raiddrive/data/system/LocationInfo-g-7-Data.db'), > SSTableReader(path='/raiddrive/data/system/LocationInfo-g-5-Data.db')] > INFO [main] 2011-06-20 21:06:58,007 MessagingService.java (line 201) > Starting Messaging Service on port 7000 > INFO [CompactionExecutor:1] 2011-06-20 21:06:58,015 > CompactionIterator.java (line 186) Major@291813814(system, LocationInfo, > 467/770) now compacting at 16777 bytes/ms. > INFO [main] 2011-06-20 21:06:58,032 StorageService.java (line 482) Using > saved token 56713727820156410577229101238628035242 > INFO [main] 2011-06-20 21:06:58,033 ColumnFamilyStore.java (line 1011) > Enqueuing flush of Memtable-LocationInfo@934909150(53/66 serialized/live > bytes, 2 ops) > INFO [FlushWriter:1] 2011-06-20 21:06:58,033 Memtable.java (line 237) > Writing Memtable-LocationInfo@934909150(53/66 serialized/live bytes, 2 > ops) > INFO [FlushWriter:1] 2011-06-20 21:06:58,157 Memtable.java (line 254) > Completed flushing /raiddrive/data/system/LocationInfo-g-10-Data.db (163 > bytes) > INFO [CompactionExecutor:1] 2011-06-20 21:06:58,169 CompactionManager.java > (line 603) Compacted to /raiddrive/data/system/LocationInfo-tmp-g-9-Data.db. > 770 to 447 (~58% of original) bytes for 3 keys. Time: 168ms. > INFO [main] 2011-06-20 21:06:58,206 Mx4jTool.java (line 67) mx4j > successfuly loaded > INFO [main] 2011-06-20 21:06:58,249 BriskDaemon.java (line 146) Binding > thrift service to /0.0.0.0:9160 > INFO [main] 2011-06-20 21:06:58,252 BriskDaemon.java (line 160) Using > TFastFramedTransport with a max frame size of 15728640 bytes. > INFO [Thread-4] 2011-06-20 21:06:58,254 BriskDaemon.java (line 187) > Listening for thrift clients... > > > Running nodetool ring on Node1 shows: > ubuntu@ip-10-68-x-x:~/brisk-1.0~beta1.2/resources/cassandra$ bin/nodetool > -h localhost ring > Address Status State Load Owns Token > 10.68.x.x Up Normal 10.9 KB 100.00% 0 > > nodetool ring on Node2 shows: > ubuntu@domU-12-31-39-10-x-x:~/brisk-1.0~beta1.2/resources/cassandra$ > bin/nodetool -h localhost ring > Address Status State Load Owns Token > > 10.198.x.x Up Normal 15.21 KB 100.00% > 56713727820156410577229101238628035242 > > > I have also tried placing all three nodes in the same data center, like > this, with no luck: > 10.68.x.x=DC1:RAC1 > 10.198.x.x=DC1:RAC2 > 10.204.x.x=DC1:RAC3 > > After the above change, all nodes still join their own ring and take claim > of 100% of the ring. Here is the full startup log for when just one data > center is specified in the topology.properties file: > > Node 1: http://pastebin.com/Vzy2u9WB > Node 2: http://pastebin.com/rqGy5Asy > > On a side note, I have also tried switching the snitch in the YAML file on > all three nodes to BriskSimpleSnitch. The problem persists where the nodes > still don't join the same ring and the same symptoms are exhibited. So, I'm > guessing the problem is not necessarily the snitch, but something else? > > I can ping all three nodes from each other and the following ports are open > between the nodes: ICMP, TCP 1024-65535, 7000, 7199, 8012, 8888 > > > Questions: > > 1) What am I doing wrong that's preventing the nodes from seeing each other > and joining 1 ring? What should I look at more closely to troubleshoot this? > > 2) Would it help to troubleshoot this if I turn on DEBUG logging for > Cassandra and then restart the "bin/brisk cassandra" service? > --00151773e34075cf6b04a62e1b0a Content-Type: text/html; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable Quick update...

I'm trying to get a 3-node cluster d= efined the following way in the topology.properties file to work first:
10.68.x.x=3DDC1:RAC1
10.198.x.x=3DDC1:RAC2
10.204.x.x= =3DDC1:RAC3

I'll split up the 3rd node into = a separate data center later.

Also, ignore that co= mment I made about the=A0$BRISK_HOME/lib/ folder not existing. When you run= ANT, I believe it populates correctly, but I'll have to confirm/test l= ater.

Based on Joaquin @ DataStax's suggestion, I tried c= hanging the Seed IP in all 3 nodes' YAML file to the Amazon Private IP,= instead of the Elastic IP. After this change, all three nodes joined the r= ing correctly:

ubuntu@ip-10-68-x-x:~/brisk-1.0~beta1.2/resources/= cassandra/conf$ ../bin/nodetool -h localhost ring
Address =A0 =A0= =A0 =A0 Status State =A0 Load =A0 =A0 =A0 =A0 =A0 =A0Owns =A0 =A0Token
=A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 = =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A011342745564031282115445820247725= 6070485
10.68.x.x =A0 =A0 Up =A0 =A0 Normal =A010.9 KB =A0 =A0 =A0 =A0 33.33% = =A00
10.198.x.x =A0Up =A0 =A0 Normal =A015.21 KB =A0 =A0 =A0 =A03= 3.33% =A056713727820156410577229101238628035242
10.204.x.x =A0Up = =A0 =A0 Normal =A06.55 KB =A0 =A0 =A0 =A0 33.33% =A011342745564031282115445= 8202477256070485

PasteBin is down and is showing me a diligent cat= typing on a keyboard, so I uploaded some relevant DEBUG level log files he= re:




But if I want t= o set up the Brisk cluster across Amazon regions, I have to be able to use = the Elastic IP for the seed. Also, using v 0.7.4 of Cassandra in Amazon, we= successfully set up a 30+ node cluster using 3 seed nodes which were decla= red in the YAML file using Elastic IPs. All 30 nodes were in the same regio= n and availability zone. So, in an older version of Cassandra, providing th= e Seeds as Elastic IP used to work.

In my current setup, even though nodes 1 & 2 are in= the same region & availability zone, I can't seem to get them to j= oin the same ring correctly.
=A0

Here is= what the system log file shows when I declare the Seed using Elastic IP:
INFO [Thread-4] 2011-06-21 00:10:30,849 BriskDaemon.java (line 18= 7) Listening for thrift clients...
DEBUG [GossipTasks:1] 2011-06-= 21 00:10:31,608 Gossiper.java (line 201) Assuming current protocol version = for /50.17.x.x
DEBUG [WRITE-/50.17.212.84] 2011-0= 6-21 00:10:31,610 OutboundTcpConnection.java (line 161) attempting to conne= ct to /50.17.x.x
DEBUG [GossipTasks:1] 2011-06-21 00:10:32,610 Go= ssiper.java (line 201) Assuming current protocol version for /50.17.x.x
DEBUG [ScheduledTasks:1] 2011-06-21 00:10:32,613 StorageLoadBalancer.j= ava (line 334) Disseminating load info ...
DEBUG [GossipTasks:1] = 2011-06-21 00:10:33,611 Gossiper.java (line 201) Assuming current protocol = version for /50.17.x.x
DEBUG [GossipTasks:1] 2011-06-21 00:10:34,612 Gossiper.java (line 201)= Assuming current protocol version for /50.17.x.x


But when I use private IP, the log shows:

INFO [Thread-4] 2011-06-21 00:19:47,993 BriskDaemon.jav= a (line 187) Listening for thrift clients...
DEBUG [ScheduledTask= s:1] 2011-06-21 00:19:49,769 StorageLoadBalancer.java (line 334) Disseminat= ing load info ...
DEBUG [WRITE-/10.198.126.193] 20= 11-06-21 00:20:09,658 OutboundTcpConnection.java (line 161) attempting to c= onnect to /10.198.x.x
=A0INFO [GossipStage:1] 2011-06-21 00:20:09= ,690 Gossiper.java (line 637) Node /10.198.x.x is now part of the cluster
DEBUG [GossipStage:1] 2011-06-21 00:20:09,691 MessagingService.java (l= ine 158) Resetting pool for /10.198.x.x
=A0INFO [GossipStage:1] 2= 011-06-21 00:20:09,691 Gossiper.java (line 605) InetAddress /10.198.x.x is = now UP
DEBUG [HintedHandoff:1] 2011-06-21 00:20:09,692 HintedHandOffManager.j= ava (line 282) Checking remote schema before delivering hints
DEB= UG [HintedHandoff:1] 2011-06-21 00:20:09,692 HintedHandOffManager.java (lin= e 274) schema for /10.198.x.x matches local schema
DEBUG [HintedHandoff:1] 2011-06-21 00:20:09,692 HintedHandOffManager.j= ava (line 288) Sleeping 11662ms to stagger hint delivery

- Sameer


On = Mon, Jun 20, 2011 at 2:28 PM, Sameer Farooqui <cassandralabs@gmail.com> = wrote:
Hi,

I'm setting up a= 3 node test cluster in multiple Amazon Availability Zones to test cross-zo= ne internode communication (and eventually cross-region communications).

But I wanted to start with a cross-zone setup and am ha= ving trouble getting the nodes to connect to each other and join one 3-node= ring. All nodes just seem to join their own ring and claim 100% of that sp= ace.

I'm using this Beta2 distribution of Brisk:=A0
<= a href=3D"http://debian.datastax.com/maverick/pool/brisk_1.0~beta1.2.tar.gz= " target=3D"_blank">http://debian.datastax.com/maverick/pool/brisk_1.0~beta= 1.2.tar.gz

I had to manually recreate the=A0$BRISK_HOME/lib/ folde= r because it didn't exist in the binary for some reason and I also adde= d jna and mx4j jar files to the lib directory.

The cluster is geographically located like this:

Node 1 (seed): East-A
Node 2: East-A
Node 3: East-B

The cassandra-topology.properties= file on all three nodes contains this:

# Cassandra Node IP=3DData Center:Rack
10.68.x.x=3DDC1:RAC1
10.198.x.x=3DDC1:RAC2
10.204.= x.x=3DDC2:RAC1
default=3DDC1:RAC1

=
and finally, here is what the relevant sections of the YAML = file looks like for each node:

++ Node 1 ++
cluster_name: 'Tes= t Cluster'
initial_token: 0
auto_bootstrap: f= alse
partitioner: org.apache.cassandra.dht.RandomPartitioner
- seeds: 50.17.x.x =A0 =A0#This is the elastic IP for Node 1
listen_address: 10.68.x.x
rpc_address: 0.0= .0.0
endpoint_snitch: org.apache.cassandra.locator.PropertyFileSn= itch
encryption_options:
=A0 =A0 internode_encryption: none<= /div>

++ Node 2 ++
cluster_na= me: 'Test Cluster'
initial_token: 56713727820156410= 577229101238628035242
auto_bootstrap: true
partitioner: org.apache.cassandra.dht.R= andomPartitioner
- seeds: 50.17.x.x =A0 =A0#This is the elas= tic IP for Node 1
listen_address: 10.198.x.x
rpc_address: 0.0.0.0
endpoint_snitch: org.apache.cassandra.l= ocator.PropertyFileSnitch
encryption_options:
=A0 = =A0 internode_encryption: none

++ Node 3 ++<= /div>
cluster_name: 'Test Cluster'
initial= _token: 113427455640312821154458202477256070485
auto_bootstrap: t= rue
partitioner: org.apache.cassandra.dht.RandomPartitioner
- seeds: 50.17.x.x =A0 =A0#This is the elastic IP for Node 1
listen_address: 10.204.x.x
rpc_address: 0.0.0.0<= /div>
endpoint_snitch: org.apache.cassandra.locator.PropertyFileS= nitch
encryption_options:
=A0 =A0 internode_encryption: none<= /div>


When I start Cassandra = on all three nodes using "sudo bin/brisk cassandra", the startup = log doesn't show any warnings or errors. The end of the start log on No= de1 says:
=A0INFO [main] 2011-06-20 21:06:57,702 MessagingService.java= (line 201) Starting Messaging Service on port 7000
=A0INFO [main= ] 2011-06-20 21:06:57,723 StorageService.java (line 482) Using saved token = 0
=A0INFO [main] 2011-06-20 21:06:57,724 ColumnFamilyStore.java (line 10= 11) Enqueuing flush of Memtable-LocationInfo@1260987126(38/47 serialized/li= ve bytes, 2 ops)
=A0INFO [FlushWriter:1] 2011-06-20 21:06:57,724 = Memtable.java (line 237) Writing Memtable-LocationInfo@1260987126(38/47 ser= ialized/live bytes, 2 ops)
=A0INFO [FlushWriter:1] 2011-06-20 21:06:57,809 Memtable.java (line 25= 4) Completed flushing /raiddrive/data/system/LocationInfo-g-12-Data.db (148= bytes)
=A0INFO [CompactionExecutor:2] 2011-06-20 21:06:57,812 Co= mpactionManager.java (line 539) Compacting Major: [SSTableReader(path=3D= 9;/raiddrive/data/system/LocationInfo-g-9-Data.db'), SSTableReader(path= =3D'/raiddrive/data/system/LocationInfo-g-11-Data.db'), SSTableRead= er(path=3D'/raiddrive/data/system/LocationInfo-g-10-Data.db'), SSTa= bleReader(path=3D'/raiddrive/data/system/LocationInfo-g-12-Data.db'= )]
=A0INFO [CompactionExecutor:2] 2011-06-20 21:06:57,828 CompactionItera= tor.java (line 186) Major@1110828771(system, LocationInfo, 429/808) now com= pacting at 16777 bytes/ms.
=A0INFO [main] 2011-06-20 21:06:57,881= Mx4jTool.java (line 67) mx4j successfuly loaded
=A0INFO [CompactionExecutor:2] 2011-06-20 21:06:57,909 CompactionManag= er.java (line 603) Compacted to /raiddrive/data/system/LocationInfo-tmp-g-1= 3-Data.db. =A0808 to 432 (~53% of original) bytes for 3 keys. =A0Time: 97ms= .
=A0INFO [main] 2011-06-20 21:06:57,953 BriskDaemon.java (line 146) Bin= ding thrift service to /0= .0.0.0:9160
=A0INFO [main] 2011-06-20 21:06:57,955 BriskDaemo= n.java (line 160) Using TFastFramedTransport with a max frame size of 15728= 640 bytes.
=A0INFO [Thread-4] 2011-06-20 21:06:57,958 BriskDaemon.java (line 187)= Listening for thrift clients...


<= /div>
And the end of the log on node 2 says:
=A0INF= O [main] 2011-06-20 21:06:57,899 StorageService.java (line 368) Cassandra v= ersion: 0.8.0-beta2-SNAPSHOT
=A0INFO [main] 2011-06-20 21:06:57,901 StorageService.java (line 369) = Thrift API version: 19.10.0
=A0INFO [main] 2011-06-20 21:06:57,90= 1 StorageService.java (line 382) Loading persisted ring state
=A0= INFO [main] 2011-06-20 21:06:57,904 StorageService.java (line 418) Starting= up server gossip
=A0INFO [main] 2011-06-20 21:06:57,915 ColumnFamilyStore.java (line 10= 11) Enqueuing flush of Memtable-LocationInfo@885597447(29/36 serialized/liv= e bytes, 1 ops)
=A0INFO [FlushWriter:1] 2011-06-20 21:06:57,916 M= emtable.java (line 237) Writing Memtable-LocationInfo@885597447(29/36 seria= lized/live bytes, 1 ops)
=A0INFO [FlushWriter:1] 2011-06-20 21:06:57,990 Memtable.java (line 25= 4) Completed flushing /raiddrive/data/system/LocationInfo-g-8-Data.db (80 b= ytes)
=A0INFO [CompactionExecutor:1] 2011-06-20 21:06:58,000 Comp= actionManager.java (line 539) Compacting Major: [SSTableReader(path=3D'= /raiddrive/data/system/LocationInfo-g-6-Data.db'), SSTableReader(path= =3D'/raiddrive/data/system/LocationInfo-g-8-Data.db'), SSTableReade= r(path=3D'/raiddrive/data/system/LocationInfo-g-7-Data.db'), SSTabl= eReader(path=3D'/raiddrive/data/system/LocationInfo-g-5-Data.db')]<= /div>
=A0INFO [main] 2011-06-20 21:06:58,007 MessagingService.java (line 201= ) Starting Messaging Service on port 7000
=A0INFO [CompactionExec= utor:1] 2011-06-20 21:06:58,015 CompactionIterator.java (line 186) Major@29= 1813814(system, LocationInfo, 467/770) now compacting at 16777 bytes/ms.
=A0INFO [main] 2011-06-20 21:06:58,032 StorageService.java (line 482) = Using saved token 56713727820156410577229101238628035242
=A0INFO = [main] 2011-06-20 21:06:58,033 ColumnFamilyStore.java (line 1011) Enqueuing= flush of Memtable-LocationInfo@934909150(53/66 serialized/live bytes, 2 op= s)
=A0INFO [FlushWriter:1] 2011-06-20 21:06:58,033 Memtable.java (line 23= 7) Writing Memtable-LocationInfo@934909150(53/66 serialized/live bytes, 2 o= ps)
=A0INFO [FlushWriter:1] 2011-06-20 21:06:58,157 Memtable.java= (line 254) Completed flushing /raiddrive/data/system/LocationInfo-g-10-Dat= a.db (163 bytes)
=A0INFO [CompactionExecutor:1] 2011-06-20 21:06:58,169 CompactionManag= er.java (line 603) Compacted to /raiddrive/data/system/LocationInfo-tmp-g-9= -Data.db. =A0770 to 447 (~58% of original) bytes for 3 keys. =A0Time: 168ms= .
=A0INFO [main] 2011-06-20 21:06:58,206 Mx4jTool.java (line 67) mx4j su= ccessfuly loaded
=A0INFO [main] 2011-06-20 21:06:58,249 BriskDaem= on.java (line 146) Binding thrift service to /0.0.0.0:9160
=A0INFO [main] 2011-06-20 21:06:58,252 BriskDaemon.java (line 160) Usi= ng TFastFramedTransport with a max frame size of 15728640 bytes.
= =A0INFO [Thread-4] 2011-06-20 21:06:58,254 BriskDaemon.java (line 187) List= ening for thrift clients...


Running nodetool ring on Nod= e1 shows:
ubuntu@ip-10-68-x-x:~/brisk-1.0~beta1.2/resources/= cassandra$ bin/nodetool -h localhost ring
Address =A0 =A0 =A0 =A0= Status State =A0 Load =A0 =A0 =A0 =A0 =A0 =A0Owns =A0 =A0Token
10.68.x.x =A0 =A0 Up =A0 =A0 Normal =A010.9 KB =A0 =A0 =A0 =A0 100.00%= 0

nodetool ring on Node2 shows:
<= div>ubuntu@domU-12-31-39-10-x-x:~/brisk-1.0~beta1.2/resources/cassandra$ bi= n/nodetool -h localhost ring
Address =A0 =A0 =A0 =A0 Status State =A0 Load =A0 =A0 =A0 =A0 =A0 =A0O= wns =A0 =A0Token =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 = =A0 =A0 =A0 =A0 =A0
10.198.x.x =A0Up =A0 =A0 Normal =A015.21 KB = =A0 =A0 =A0 =A0100.00% 56713727820156410577229101238628035242 =A0


I have also tried placing all three nodes in = the same data center, like this, with no luck:
10.68.x.x=3DD= C1:RAC1
10.198.x.x=3DDC1:RAC2
10.204.x.x=3DDC1:RAC3

After the above change, all nodes still join thei= r own ring and take claim of 100% of the ring. Here is the full startup log= for when just one data center is specified in the topology.properties file= :


On a side note, I have also tried switching the snitch = in the YAML file on all three nodes=A0to=A0BriskSimpleSnitch. The problem p= ersists where the nodes still don't join the same ring and the same sym= ptoms are exhibited. So, I'm guessing the problem is not necessarily th= e snitch, but something else?

I can ping all three nodes from each other and the foll= owing ports are open between the nodes:=A0ICMP, TCP 1024-65535, 7000, 7199,= 8012, 8888


Questions:

1) What am I doing wrong that's preventing the nodes fro= m seeing each other and joining 1 ring? What should I look at more closely = to troubleshoot this?

2) Would it help to troubles= hoot this if I turn on DEBUG logging for Cassandra and then restart the &qu= ot;bin/brisk cassandra" service?

--00151773e34075cf6b04a62e1b0a--