incubator-cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dimitry Lvovsky <dimi...@reviewpro.com>
Subject Re: New nodes won't bootstrap on .66
Date Thu, 28 Oct 2010 16:16:22 GMT
Hey Aaron,

> Are there any log messages in the existing nodes or the new one which
> mention each other?

>From the currently running nodes we get the message that the new is up:
INFO [GMFD:1] 2010-10-22 10:22:49,232 Gossiper.java (line 591) Node /
192.168.2.18 is now part of the cluster
 INFO [GMFD:1] 2010-10-22 10:22:49,616 Gossiper.java (line 583) InetAddress
/192.168.2.18 is now UP
 INFO [HINTED-HANDOFF-POOL:1] 2010-10-22 10:22:49,629
HintedHandOffManager.java (line 172) Started hinted handoff for endPoint /
192.168.2.18
 INFO [HINTED-HANDOFF-POOL:1] 2010-10-22 10:22:49,677
HintedHandOffManager.java (line 229) Finished hinted handoff of 0 rows to
endpoint /192.168.2.18
 INFO [STREAM-STAGE:1] 2010-10-22 10:25:12,058 StreamOut.java (line 132)
Sending a stream initiate message to /192.168.2.18 ...
 INFO [STREAM-STAGE:1] 2010-10-22 10:25:12,058 StreamOut.java (line 137)
Waiting for transfer to /192.168.2.18 to complete

As well .18 sees the other nodes with log messages such as:
INFO [GMFD:1] 2010-10-22 13:45:17,729 Gossiper.java (line 597) Node /
192.168.2.23 has restarted, now UP again
 INFO [GMFD:1] 2010-10-22 14:06:04,342 Gossiper.java (line 597) Node /
192.168.2.20 has restarted, now UP again
 INFO [GMFD:1] 2010-10-22 15:22:08,326 Gossiper.java (line 597) Node /
192.168.2.21 has restarted, now UP again
 INFO [GMFD:1] 2010-10-22 17:31:01,819 Gossiper.java (line 597) Node /
192.168.2.22 has restarted, now UP again

Is this a production system? Is it still running ?


Yep, it is a production systems and its still up.

You'll need to dig through the logs a bit more to see that the boot
> strapping started and what was the last message it logged.


Just going to dump some here log  in case  you see anything that stands out:
.....
INFO [HINTED-HANDOFF-POOL:1] 2010-10-22 10:19:33,479
HintedHandOffManager.java (line 172) Started hinted handoff for endPoint /
192.168.2.21
 INFO [HINTED-HANDOFF-POOL:1] 2010-10-22 10:19:33,480
HintedHandOffManager.java (line 229) Finished hinted handoff of 0 rows to
endpoint /192.168.2.21
 INFO [main] 2010-10-22 10:20:58,700 StorageService.java (line 391) Joining:
getting bootstrap token
 INFO [main] 2010-10-22 10:20:58,756 BootStrapper.java (line 108) New token
will be 131382354792524396082927748799616801083 to assume load from /
192.168.2.21
 INFO [main] 2010-10-22 10:20:58,757 StorageService.java (line 391) Joining:
sleeping 30000 ms for pending range setup
 INFO [main] 2010-10-22 10:21:28,757 StorageService.java (line 391)
Bootstrapping
....
INFO [Thread-84] 2010-10-22 13:42:13,136 SSTableReader.java (line 125)
Sampling index and loading saved keyCache for /var/data....
INFO [Thread-84] 2010-10-22 13:42:13,182 StreamCompletionHandler.java (line
64) Streaming added /var/data/....
....
INFO [SSTABLE-CLEANUP-TIMER] 2010-10-22 13:43:11,568
SSTableDeletingReference.java (line 107) Deleted /var/data/...
INFO [WRITE-/192.168.2.23] 2010-10-22 13:43:38,670
OutboundTcpConnection.java (line 103) error writing to /192.168.2.23
 INFO [Timer-0] 2010-10-22 13:43:40,670 Gossiper.java (line 180) InetAddress
/192.168.2.23 is now dead.
 INFO [GMFD:1] 2010-10-22 13:45:17,729 Gossiper.java (line 597) Node /
192.168.2.23 has restarted, now UP again
 INFO [HINTED-HANDOFF-POOL:1] 2010-10-22 13:45:17,730
HintedHandOffManager.java (line 172) Started hinted handoff for endPoint /
192.168.2.23
 INFO [GMFD:1] 2010-10-22 13:45:17,730 StorageService.java (line 569) Node /
192.168.2.23 state jump to normal
 INFO [GMFD:1] 2010-10-22 13:45:17,731 StorageService.java (line 576) Will
not change my token ownership to /192.168.2.23
 INFO [HINTED-HANDOFF-POOL:1] 2010-10-22 13:45:17,879
HintedHandOffManager.java (line 229) Finished hinted handoff of 0 rows to
endpoint /192.168.2.23
 INFO [Timer-0] 2010-10-22 14:04:25,722 Gossiper.java (line 180) InetAddress
/192.168.2.20 is now dead.
 INFO [GMFD:1] 2010-10-22 14:06:04,342 Gossiper.java (line 597) Node /
192.168.2.20 has restarted, now UP again
 INFO [GMFD:1] 2010-10-22 14:06:04,342 StorageService.java (line 569) Node /
192.168.2.20 state jump to normal
 INFO [GMFD:1] 2010-10-22 14:06:04,343 StorageService.java (line 576) Will
not change my token ownership to /192.168.2.20
 INFO [HINTED-HANDOFF-POOL:1] 2010-10-22 14:06:04,344
HintedHandOffManager.java (line 172) Started hinted handoff for endPoint /
192.168.2.20
 INFO [HINTED-HANDOFF-POOL:1] 2010-10-22 14:06:04,579 ColumnFamilyStore.java
(line 470) HintsColumnFamily has reached its threshold; switching in a fresh
Memtable at
CommitLogContext(file='/var/data/cassandra/commitlog/CommitLog-1287747593217.log',
position=9825007)
....


Thanks for the help, and pardon the log file vomit above.

Dimitry.



On Thu, Oct 28, 2010 at 12:44 PM, aaron morton <aaron@thelastpickle.com>wrote:

> The best approach is to manually select the tokens, see the Load Balancing
> section http://wiki.apache.org/cassandra/Operations Also
>
> Are there any log messages in the existing nodes or the new one which
> mention each other?
>
> Is this a production system? Is it still running ?
>
> Sorry there is not a lot to go on, it sounds like you've done the right
> thing. I'm assuming things like the Cluster Name, seed list and port numbers
> are set correct as the new node got some data.
>
> You'll need to dig through the logs a bit more to see that the boot
> strapping started and what was the last message it logged.
>
> Good Luck.
> Aaron
>
> On 27 Oct 2010, at 22:40, Dimitry Lvovsky wrote:
>
> Hi Aaron,
> Thanks for your reply.
>
> We still haven't solved this unfortunately.
>
> How did you start the bootstrap for the .18 node ?
>
>
> Standard way: we set "AutoBootstrap" to true and added all the servers from
> the working ring as seeds.
>
>
>> Was it the .18 or the .17 node you tried to add
>
>
> We first tried adding .17, it streamed for a while, took on a 50GB of load,
> stopped streaming but then didn't enter into the ring.  We left it for a few
> days to see if it would come in, but no luck.  After that we did
>  decommission and  removeToken ( in that order) operations.
> Since we couldn't get .17 in we tried again with .18.  Before doing so we
> increased the RpcTimeoutInMillis from 1000, to 10000 having read that this
> may cause the problem of nodes not entering into the ring.   It's been going
> since friday and still, like .17, won't come into the ring.
>
> Does it have a token in the config or did you use nodetool move to set it
>
> No we didn't manually set the token in the config, rather we were relaying
> on the token to be assigned durring bootstrap from the RandomPartitioner.
>
> Again thanks for the help.
>
> Dimitry.
>
>
>
> On Tue, Oct 26, 2010 at 10:14 PM, Aaron Morton <aaron@thelastpickle.com>wrote:
>
>> Dimitry, Did you get anywhere with this ?
>>
>> Was it the .18 or the .17 node you tried to add ? How did you start the
>> bootstrap for the .18 node ? Does it have a token in the config or did you
>> use nodetool move to set it?
>>
>> I had a quick look at the code AKAIK  the message about removing the fat
>> client is logged when the node does not have a record of the token the other
>> node as.
>>
>> Aaron
>>
>> On 26 Oct, 2010,at 10:42 PM, Dimitry Lvovsky <dimitry@reviewpro.com>
>> wrote:
>>
>> Hi All,
>> We recently upgraded from .65 to .66 after which we tried adding a new
>> node to our cluster. We left it bootstrapping and after 3 days, it still
>> refused to join the ring. The strange thing is that nodetool info shows 50GB
>> of load and nodetool ring shows that it sees the rest of ring, which it is
>> not part of. We tried the process again with another server -- again the
>> same thing as before:
>>
>>
>> //from machine 192.168.218
>>
>>
>> /opt/cassandra/bin/nodetool -h localhost -p 8999 info
>> 131373516047318302934572185119435768941
>> Load : 52.85 GB
>> Generation No : 1287761987
>> Uptime (seconds) : 323157
>> Heap Memory (MB) : 795.42 / 1945.63
>>
>>
>> /opt/cassandra/bin/nodetool -h localhost -p 8999 ring
>> Address Status Load Range Ring
>> 158573510920250391466717289405976537674
>> 192.168.2.22 Up 59.45 GB 28203205416427384773583427414698832202 |<--|
>> 192.168.2.23 Up 44.95 GB 60562227403709245514637766500430120055 | |
>> 192.168.2.20 Up 47.15 GB 104160057322065544623939416372654814065 | |
>> 192.168.2.21 Up 61.04 GB 158573510920250391466717289405976537674 |-->|
>>
>> opt/cassandra/bin/nodetool -h localhost -p 8999 streams
>> Mode: Bootstrapping
>> Not sending any streams.
>> Not receiving any streams.
>>
>>
>> Whats more, while looking at the log of one of the nodes I see gossip
>> messages from 192.168.1.17 -- the first node we tried to add to the cluster
>> but which is not running at the the time of the log message:
>> INFO [Timer-0] 2010-10-26 02:13:20,340 Gossiper.java (line 406) FatClient
>> /192.168.2.17 has been silent for 3600000ms, removing from gossip
>> INFO [GMFD:1] 2010-10-26 02:13:51,398 Gossiper.java (line 591) Node /
>> 192.168.2.17 is now part of the cluster
>>
>>
>> Thanks in advance for the help,
>> Dimitry
>>
>>
>
>
> --
> Dimitry Lvovsky
> Director of Engineering
> ReviewPro
> www.reviewpro.com
> +34 616 337 103
>
>
>


-- 
Dimitry Lvovsky
Director of Engineering
ReviewPro
www.reviewpro.com
+34 616 337 103

Mime
View raw message