incubator-cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dimitry Lvovsky <dimi...@reviewpro.com>
Subject Re: New nodes won't bootstrap on .66
Date Mon, 08 Nov 2010 19:23:29 GMT
We didn't solve it unfortunately and and ended up regenerating the entire
cluster.  But, if it helps anyone in the future, we too had multiple
keyspaces when we encountered the problem.



On Mon, Nov 8, 2010 at 5:47 PM, Marc Canaleta <mcanaleta@gmail.com> wrote:

> I have just solved the problem removing the second keyspace (manually
> moving its column families to the first). So it seems the problem appears
> when having multiple keyspaces.
>
> 2010/11/8 Thibaut Britz <thibaut.britz@trendiction.com>
>
> Hi,
>>
>> No I didn't solve the problem. I reinitialized the cluster and gave each
>> node manually a token before adding data. There are a few messages in
>> multiple threads related to this, so I suspect it's very common and I hope
>> it's gone with 0.7.
>>
>> Thibaut
>>
>>
>>
>>
>>
>> On Sun, Nov 7, 2010 at 6:57 PM, Marc Canaleta <mcanaleta@gmail.com>wrote:
>>
>>> Hi,
>>>
>>> Did you solve this problem? I'm having the same poblem. I'm trying to
>>> bootstrap a third node in a 0.66 cluster. It has two keyspaces: Keyspace1
>>> and KeyspaceLogs, both with replication factor 2.
>>>
>>> It starts bootstrapping, receives some streams but it keeps waiting for
>>> streams. I enabled the debug mode. This lines may be useful:
>>>
>>> DEBUG [main] 2010-11-07 17:39:50,052 BootStrapper.java (line 70)
>>> Beginning bootstrap process
>>> DEBUG [main] 2010-11-07 17:39:50,082 StorageService.java (line 160) Added
>>> /10.204.93.16/Keyspace1 as a bootstrap source
>>> ...
>>> DEBUG [main] 2010-11-07 17:39:50,090 StorageService.java (line 160) Added
>>> /10.204.93.16/KeyspaceLogs as a bootstrap source
>>> ... (streaming mesages)
>>> DEBUG [Thread-56] 2010-11-07 17:45:51,706 StorageService.java (line 171)
>>> Removed /10.204.93.16/Keyspace1 as a bootstrap source; remaining is [/
>>> 10.204.93.16]
>>> ...
>>> (and never ends).
>>>
>>> It seems it is waiting for  [/10.204.93.16] when it should be waiting
>>> for /10.204.93.16/KeyspaceLogs.
>>>
>>> The third node is 64 bits, while the two existing nodes are 32 bits. Can
>>> this be a problem?
>>>
>>> Thank you.
>>>
>>>
>>> 2010/10/28 Dimitry Lvovsky <dimitry@reviewpro.com>
>>>
>>> Maybe your    <StoragePort>7000</StoragePort> is being blocked by
>>>> iptables or some firewall or maybe you have it bound (<ListenAddress>
tag )
>>>>  to localhost instead an ip address.
>>>>
>>>> Hope this helps,
>>>> Dimitry.
>>>>
>>>>
>>>>
>>>> On Thu, Oct 28, 2010 at 5:35 PM, Thibaut Britz <
>>>> thibaut.britz@trendiction.com> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> I have the same problem with 0.6.5
>>>>>
>>>>> New nodes will hang forever in bootstrap mode (no streams are being
>>>>> opened) and the receiver thread just waits for data forever:
>>>>>
>>>>>
>>>>>  INFO [Thread-53] 2010-10-27 20:33:37,399 SSTableReader.java (line 120)
>>>>> Sampling index for /hd2/cassandra/data/table_xyz/
>>>>> table_xyz-3-Data.db
>>>>>  INFO [Thread-53] 2010-10-27 20:33:37,444 StreamCompletionHandler.java
>>>>> (line 64) Streaming added /hd2/cassandra/data/table_xyz/table_xyz-3-Data.db
>>>>>
>>>>> Stacktracke:
>>>>>
>>>>> "pool-1-thread-53" prio=10 tid=0x00000000412f2800 nid=0x215c runnable
>>>>> [0x00007fd7cf217000]
>>>>>    java.lang.Thread.State: RUNNABLE
>>>>>         at java.net.SocketInputStream.socketRead0(Native Method)
>>>>>         at java.net.SocketInputStream.read(SocketInputStream.java:129)
>>>>>         at
>>>>> java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
>>>>>         at
>>>>> java.io.BufferedInputStream.read1(BufferedInputStream.java:258)
>>>>>         at
>>>>> java.io.BufferedInputStream.read(BufferedInputStream.java:317)
>>>>>         - locked <0x00007fd7e77e0520> (a java.io.BufferedInputStream)
>>>>>         at
>>>>> org.apache.thrift.transport.TIOStreamTransport.read(TIOStreamTransport.java:126)
>>>>>         at
>>>>> org.apache.thrift.transport.TTransport.readAll(TTransport.java:84)
>>>>>         at
>>>>> org.apache.thrift.protocol.TBinaryProtocol.readAll(TBinaryProtocol.java:314)
>>>>>         at
>>>>> org.apache.thrift.protocol.TBinaryProtocol.readI32(TBinaryProtocol.java:262)
>>>>>         at
>>>>> org.apache.thrift.protocol.TBinaryProtocol.readMessageBegin(TBinaryProtocol.java:192)
>>>>>         at
>>>>> org.apache.cassandra.thrift.Cassandra$Processor.process(Cassandra.java:1154)
>>>>>         at
>>>>> org.apache.cassandra.thrift.CustomTThreadPoolServer$WorkerProcess.run(CustomTThreadPoolServer.java:167)
>>>>>         at
>>>>> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
>>>>>         at
>>>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
>>>>>         at java.lang.Thread.run(Thread.java:662)
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Thu, Oct 28, 2010 at 12:44 PM, aaron morton <
>>>>> aaron@thelastpickle.com> wrote:
>>>>>
>>>>>> The best approach is to manually select the tokens, see the Load
>>>>>> Balancing section http://wiki.apache.org/cassandra/Operations Also
>>>>>>
>>>>>> Are there any log messages in the existing nodes or the new one which
>>>>>> mention each other?
>>>>>>
>>>>>> Is this a production system? Is it still running ?
>>>>>>
>>>>>> Sorry there is not a lot to go on, it sounds like you've done the
>>>>>> right thing. I'm assuming things like the Cluster Name, seed list
and port
>>>>>> numbers are set correct as the new node got some data.
>>>>>>
>>>>>> You'll need to dig through the logs a bit more to see that the boot
>>>>>> strapping started and what was the last message it logged.
>>>>>>
>>>>>> Good Luck.
>>>>>> Aaron
>>>>>>
>>>>>> On 27 Oct 2010, at 22:40, Dimitry Lvovsky wrote:
>>>>>>
>>>>>> Hi Aaron,
>>>>>> Thanks for your reply.
>>>>>>
>>>>>> We still haven't solved this unfortunately.
>>>>>>
>>>>>>  How did you start the bootstrap for the .18 node ?
>>>>>>
>>>>>>
>>>>>> Standard way: we set "AutoBootstrap" to true and added all the servers
>>>>>> from the working ring as seeds.
>>>>>>
>>>>>>
>>>>>>> Was it the .18 or the .17 node you tried to add
>>>>>>
>>>>>>
>>>>>> We first tried adding .17, it streamed for a while, took on a 50GB
of
>>>>>> load, stopped streaming but then didn't enter into the ring.  We
left it for
>>>>>> a few days to see if it would come in, but no luck.  After that we
did
>>>>>>  decommission and  removeToken ( in that order) operations.
>>>>>> Since we couldn't get .17 in we tried again with .18.  Before doing
so
>>>>>> we increased the RpcTimeoutInMillis from 1000, to 10000 having read
that
>>>>>> this may cause the problem of nodes not entering into the ring. 
 It's been
>>>>>> going since friday and still, like .17, won't come into the ring.
>>>>>>
>>>>>> Does it have a token in the config or did you use nodetool move to
set
>>>>>>> it
>>>>>>
>>>>>> No we didn't manually set the token in the config, rather we were
>>>>>> relaying on the token to be assigned durring bootstrap from the
>>>>>> RandomPartitioner.
>>>>>>
>>>>>> Again thanks for the help.
>>>>>>
>>>>>> Dimitry.
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Tue, Oct 26, 2010 at 10:14 PM, Aaron Morton <
>>>>>> aaron@thelastpickle.com> wrote:
>>>>>>
>>>>>>> Dimitry, Did you get anywhere with this ?
>>>>>>>
>>>>>>> Was it the .18 or the .17 node you tried to add ? How did you
start
>>>>>>> the bootstrap for the .18 node ? Does it have a token in the
config or did
>>>>>>> you use nodetool move to set it?
>>>>>>>
>>>>>>> I had a quick look at the code AKAIK  the message about removing
the
>>>>>>> fat client is logged when the node does not have a record of
the token the
>>>>>>> other node as.
>>>>>>>
>>>>>>> Aaron
>>>>>>>
>>>>>>> On 26 Oct, 2010,at 10:42 PM, Dimitry Lvovsky <dimitry@reviewpro.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>> Hi All,
>>>>>>> We recently upgraded from .65 to .66 after which we tried adding
a
>>>>>>> new node to our cluster. We left it bootstrapping and after 3
days, it still
>>>>>>> refused to join the ring. The strange thing is that nodetool
info shows 50GB
>>>>>>> of load and nodetool ring shows that it sees the rest of ring,
which it is
>>>>>>> not part of. We tried the process again with another server --
again the
>>>>>>> same thing as before:
>>>>>>>
>>>>>>>
>>>>>>> //from machine 192.168.218
>>>>>>>
>>>>>>>
>>>>>>> /opt/cassandra/bin/nodetool -h localhost -p 8999 info
>>>>>>> 131373516047318302934572185119435768941
>>>>>>> Load : 52.85 GB
>>>>>>> Generation No : 1287761987
>>>>>>> Uptime (seconds) : 323157
>>>>>>> Heap Memory (MB) : 795.42 / 1945.63
>>>>>>>
>>>>>>>
>>>>>>> /opt/cassandra/bin/nodetool -h localhost -p 8999 ring
>>>>>>> Address Status Load Range Ring
>>>>>>> 158573510920250391466717289405976537674
>>>>>>> 192.168.2.22 Up 59.45 GB 28203205416427384773583427414698832202
|<--|
>>>>>>> 192.168.2.23 Up 44.95 GB 60562227403709245514637766500430120055
| |
>>>>>>> 192.168.2.20 Up 47.15 GB 104160057322065544623939416372654814065
| |
>>>>>>> 192.168.2.21 Up 61.04 GB 158573510920250391466717289405976537674
>>>>>>> |-->|
>>>>>>>
>>>>>>> opt/cassandra/bin/nodetool -h localhost -p 8999 streams
>>>>>>> Mode: Bootstrapping
>>>>>>> Not sending any streams.
>>>>>>> Not receiving any streams.
>>>>>>>
>>>>>>>
>>>>>>> Whats more, while looking at the log of one of the nodes I see
gossip
>>>>>>> messages from 192.168.1.17 -- the first node we tried to add
to the cluster
>>>>>>> but which is not running at the the time of the log message:
>>>>>>> INFO [Timer-0] 2010-10-26 02:13:20,340 Gossiper.java (line 406)
>>>>>>> FatClient /192.168.2.17 has been silent for 3600000ms, removing
from
>>>>>>> gossip
>>>>>>> INFO [GMFD:1] 2010-10-26 02:13:51,398 Gossiper.java (line 591)
Node /
>>>>>>> 192.168.2.17 is now part of the cluster
>>>>>>>
>>>>>>>
>>>>>>> Thanks in advance for the help,
>>>>>>> Dimitry
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Dimitry Lvovsky
>>>>>> Director of Engineering
>>>>>> ReviewPro
>>>>>> www.reviewpro.com
>>>>>> +34 616 337 103
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> Dimitry Lvovsky
>>>> Director of Engineering
>>>> ReviewPro
>>>> www.reviewpro.com
>>>> +34 616 337 103
>>>>
>>>
>>>
>>
>


-- 
Dimitry Lvovsky
Director of Engineering
ReviewPro
www.reviewpro.com
+34 616 337 103

Mime
View raw message