We didn't solve it unfortunately and and ended up regenerating the entire cluster.  But, if it helps anyone in the future, we too had multiple keyspaces when we encountered the problem.  



On Mon, Nov 8, 2010 at 5:47 PM, Marc Canaleta <mcanaleta@gmail.com> wrote:
I have just solved the problem removing the second keyspace (manually moving its column families to the first). So it seems the problem appears when having multiple keyspaces.

2010/11/8 Thibaut Britz <thibaut.britz@trendiction.com>

Hi,

No I didn't solve the problem. I reinitialized the cluster and gave each node manually a token before adding data. There are a few messages in multiple threads related to this, so I suspect it's very common and I hope it's gone with 0.7.

Thibaut





On Sun, Nov 7, 2010 at 6:57 PM, Marc Canaleta <mcanaleta@gmail.com> wrote:
Hi,

Did you solve this problem? I'm having the same poblem. I'm trying to bootstrap a third node in a 0.66 cluster. It has two keyspaces: Keyspace1 and KeyspaceLogs, both with replication factor 2. 

It starts bootstrapping, receives some streams but it keeps waiting for streams. I enabled the debug mode. This lines may be useful:

DEBUG [main] 2010-11-07 17:39:50,052 BootStrapper.java (line 70) Beginning bootstrap process
DEBUG [main] 2010-11-07 17:39:50,082 StorageService.java (line 160) Added /10.204.93.16/Keyspace1 as a bootstrap source
...
DEBUG [main] 2010-11-07 17:39:50,090 StorageService.java (line 160) Added /10.204.93.16/KeyspaceLogs as a bootstrap source
... (streaming mesages)
DEBUG [Thread-56] 2010-11-07 17:45:51,706 StorageService.java (line 171) Removed /10.204.93.16/Keyspace1 as a bootstrap source; remaining is [/10.204.93.16]
...
(and never ends).

It seems it is waiting for  [/10.204.93.16] when it should be waiting for /10.204.93.16/KeyspaceLogs.

The third node is 64 bits, while the two existing nodes are 32 bits. Can this be a problem?

Thank you.


2010/10/28 Dimitry Lvovsky <dimitry@reviewpro.com>

Maybe your    <StoragePort>7000</StoragePort> is being blocked by iptables or some firewall or maybe you have it bound (<ListenAddress> tag )  to localhost instead an ip address.

Hope this helps,
Dimitry.



On Thu, Oct 28, 2010 at 5:35 PM, Thibaut Britz <thibaut.britz@trendiction.com> wrote:
Hi,

I have the same problem with 0.6.5

New nodes will hang forever in bootstrap mode (no streams are being opened) and the receiver thread just waits for data forever:


 INFO [Thread-53] 2010-10-27 20:33:37,399 SSTableReader.java (line 120) Sampling index for /hd2/cassandra/data/table_xyz/
table_xyz-3-Data.db
 INFO [Thread-53] 2010-10-27 20:33:37,444 StreamCompletionHandler.java (line 64) Streaming added /hd2/cassandra/data/table_xyz/table_xyz-3-Data.db

Stacktracke:

"pool-1-thread-53" prio=10 tid=0x00000000412f2800 nid=0x215c runnable [0x00007fd7cf217000]
   java.lang.Thread.State: RUNNABLE
        at java.net.SocketInputStream.socketRead0(Native Method)
        at java.net.SocketInputStream.read(SocketInputStream.java:129)
        at java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
        at java.io.BufferedInputStream.read1(BufferedInputStream.java:258)
        at java.io.BufferedInputStream.read(BufferedInputStream.java:317)
        - locked <0x00007fd7e77e0520> (a java.io.BufferedInputStream)
        at org.apache.thrift.transport.TIOStreamTransport.read(TIOStreamTransport.java:126)
        at org.apache.thrift.transport.TTransport.readAll(TTransport.java:84)
        at org.apache.thrift.protocol.TBinaryProtocol.readAll(TBinaryProtocol.java:314)
        at org.apache.thrift.protocol.TBinaryProtocol.readI32(TBinaryProtocol.java:262)
        at org.apache.thrift.protocol.TBinaryProtocol.readMessageBegin(TBinaryProtocol.java:192)
        at org.apache.cassandra.thrift.Cassandra$Processor.process(Cassandra.java:1154)
        at org.apache.cassandra.thrift.CustomTThreadPoolServer$WorkerProcess.run(CustomTThreadPoolServer.java:167)
        at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
        at java.lang.Thread.run(Thread.java:662)












On Thu, Oct 28, 2010 at 12:44 PM, aaron morton <aaron@thelastpickle.com> wrote:
The best approach is to manually select the tokens, see the Load Balancing section http://wiki.apache.org/cassandra/Operations Also 

Are there any log messages in the existing nodes or the new one which mention each other? 

Is this a production system? Is it still running ? 

Sorry there is not a lot to go on, it sounds like you've done the right thing. I'm assuming things like the Cluster Name, seed list and port numbers are set correct as the new node got some data.

You'll need to dig through the logs a bit more to see that the boot strapping started and what was the last message it logged. 

Good Luck. 
Aaron

On 27 Oct 2010, at 22:40, Dimitry Lvovsky wrote:

Hi Aaron,
Thanks for your reply.

We still haven't solved this unfortunately. 

How did you start the bootstrap for the .18 node ? 

Standard way: we set "AutoBootstrap" to true and added all the servers from the working ring as seeds.  
 
Was it the .18 or the .17 node you tried to add

We first tried adding .17, it streamed for a while, took on a 50GB of load, stopped streaming but then didn't enter into the ring.  We left it for a few days to see if it would come in, but no luck.  After that we did  decommission and  removeToken ( in that order) operations. 
Since we couldn't get .17 in we tried again with .18.  Before doing so we increased the RpcTimeoutInMillis from 1000, to 10000 having read that this may cause the problem of nodes not entering into the ring.   It's been going since friday and still, like .17, won't come into the ring.

Does it have a token in the config or did you use nodetool move to set it
No we didn't manually set the token in the config, rather we were relaying on the token to be assigned durring bootstrap from the RandomPartitioner.

Again thanks for the help.

Dimitry.
  

 
On Tue, Oct 26, 2010 at 10:14 PM, Aaron Morton <aaron@thelastpickle.com> wrote:
Dimitry, Did you get anywhere with this ? 

Was it the .18 or the .17 node you tried to add ? How did you start the bootstrap for the .18 node ? Does it have a token in the config or did you use nodetool move to set it? 

I had a quick look at the code AKAIK  the message about removing the fat client is logged when the node does not have a record of the token the other node as. 

Aaron

On 26 Oct, 2010,at 10:42 PM, Dimitry Lvovsky <dimitry@reviewpro.com> wrote:

Hi All,
We recently upgraded from .65 to .66 after which we tried adding a new node to our cluster. We left it bootstrapping and after 3 days, it still refused to join the ring. The strange thing is that nodetool info shows 50GB of load and nodetool ring shows that it sees the rest of ring, which it is not part of. We tried the process again with another server -- again the same thing as before:


//from machine 192.168.218


/opt/cassandra/bin/nodetool -h localhost -p 8999 info
131373516047318302934572185119435768941
Load : 52.85 GB
Generation No : 1287761987
Uptime (seconds) : 323157
Heap Memory (MB) : 795.42 / 1945.63


/opt/cassandra/bin/nodetool -h localhost -p 8999 ring
Address Status Load Range Ring
158573510920250391466717289405976537674
192.168.2.22 Up 59.45 GB 28203205416427384773583427414698832202 |<--|
192.168.2.23 Up 44.95 GB 60562227403709245514637766500430120055 | |
192.168.2.20 Up 47.15 GB 104160057322065544623939416372654814065 | |
192.168.2.21 Up 61.04 GB 158573510920250391466717289405976537674 |-->|

opt/cassandra/bin/nodetool -h localhost -p 8999 streams
Mode: Bootstrapping
Not sending any streams.
Not receiving any streams.


Whats more, while looking at the log of one of the nodes I see gossip messages from 192.168.1.17 -- the first node we tried to add to the cluster but which is not running at the the time of the log message:
INFO [Timer-0] 2010-10-26 02:13:20,340 Gossiper.java (line 406) FatClient /192.168.2.17 has been silent for 3600000ms, removing from gossip
INFO [GMFD:1] 2010-10-26 02:13:51,398 Gossiper.java (line 591) Node /192.168.2.17 is now part of the cluster


Thanks in advance for the help,
Dimitry



--
Dimitry Lvovsky
Director of Engineering
ReviewPro
www.reviewpro.com
+34 616 337 103





--
Dimitry Lvovsky
Director of Engineering
ReviewPro
www.reviewpro.com
+34 616 337 103






--
Dimitry Lvovsky
Director of Engineering
ReviewPro
www.reviewpro.com
+34 616 337 103