incubator-cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Thibaut Britz <thibaut.br...@trendiction.com>
Subject Re: New nodes won't bootstrap on .66
Date Mon, 08 Nov 2010 20:37:53 GMT
I had also multiple keyspaces defined (> 20). All nodes were 64 bit, no
mixtures.


On Mon, Nov 8, 2010 at 8:23 PM, Dimitry Lvovsky <dimitry@reviewpro.com>wrote:

> We didn't solve it unfortunately and and ended up regenerating the entire
> cluster.  But, if it helps anyone in the future, we too had multiple
> keyspaces when we encountered the problem.
>
>
>
> On Mon, Nov 8, 2010 at 5:47 PM, Marc Canaleta <mcanaleta@gmail.com> wrote:
>
>> I have just solved the problem removing the second keyspace (manually
>> moving its column families to the first). So it seems the problem appears
>> when having multiple keyspaces.
>>
>> 2010/11/8 Thibaut Britz <thibaut.britz@trendiction.com>
>>
>> Hi,
>>>
>>> No I didn't solve the problem. I reinitialized the cluster and gave each
>>> node manually a token before adding data. There are a few messages in
>>> multiple threads related to this, so I suspect it's very common and I hope
>>> it's gone with 0.7.
>>>
>>> Thibaut
>>>
>>>
>>>
>>>
>>>
>>> On Sun, Nov 7, 2010 at 6:57 PM, Marc Canaleta <mcanaleta@gmail.com>wrote:
>>>
>>>> Hi,
>>>>
>>>> Did you solve this problem? I'm having the same poblem. I'm trying to
>>>> bootstrap a third node in a 0.66 cluster. It has two keyspaces: Keyspace1
>>>> and KeyspaceLogs, both with replication factor 2.
>>>>
>>>> It starts bootstrapping, receives some streams but it keeps waiting for
>>>> streams. I enabled the debug mode. This lines may be useful:
>>>>
>>>> DEBUG [main] 2010-11-07 17:39:50,052 BootStrapper.java (line 70)
>>>> Beginning bootstrap process
>>>> DEBUG [main] 2010-11-07 17:39:50,082 StorageService.java (line 160)
>>>> Added /10.204.93.16/Keyspace1 as a bootstrap source
>>>> ...
>>>> DEBUG [main] 2010-11-07 17:39:50,090 StorageService.java (line 160)
>>>> Added /10.204.93.16/KeyspaceLogs as a bootstrap source
>>>> ... (streaming mesages)
>>>> DEBUG [Thread-56] 2010-11-07 17:45:51,706 StorageService.java (line 171)
>>>> Removed /10.204.93.16/Keyspace1 as a bootstrap source; remaining is [/
>>>> 10.204.93.16]
>>>> ...
>>>> (and never ends).
>>>>
>>>> It seems it is waiting for  [/10.204.93.16] when it should be waiting
>>>> for /10.204.93.16/KeyspaceLogs.
>>>>
>>>> The third node is 64 bits, while the two existing nodes are 32 bits. Can
>>>> this be a problem?
>>>>
>>>> Thank you.
>>>>
>>>>
>>>> 2010/10/28 Dimitry Lvovsky <dimitry@reviewpro.com>
>>>>
>>>> Maybe your    <StoragePort>7000</StoragePort> is being blocked
by
>>>>> iptables or some firewall or maybe you have it bound (<ListenAddress>
tag )
>>>>>  to localhost instead an ip address.
>>>>>
>>>>> Hope this helps,
>>>>> Dimitry.
>>>>>
>>>>>
>>>>>
>>>>> On Thu, Oct 28, 2010 at 5:35 PM, Thibaut Britz <
>>>>> thibaut.britz@trendiction.com> wrote:
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> I have the same problem with 0.6.5
>>>>>>
>>>>>> New nodes will hang forever in bootstrap mode (no streams are being
>>>>>> opened) and the receiver thread just waits for data forever:
>>>>>>
>>>>>>
>>>>>>  INFO [Thread-53] 2010-10-27 20:33:37,399 SSTableReader.java (line
>>>>>> 120) Sampling index for /hd2/cassandra/data/table_xyz/
>>>>>> table_xyz-3-Data.db
>>>>>>  INFO [Thread-53] 2010-10-27 20:33:37,444 StreamCompletionHandler.java
>>>>>> (line 64) Streaming added /hd2/cassandra/data/table_xyz/table_xyz-3-Data.db
>>>>>>
>>>>>> Stacktracke:
>>>>>>
>>>>>> "pool-1-thread-53" prio=10 tid=0x00000000412f2800 nid=0x215c runnable
>>>>>> [0x00007fd7cf217000]
>>>>>>    java.lang.Thread.State: RUNNABLE
>>>>>>         at java.net.SocketInputStream.socketRead0(Native Method)
>>>>>>         at java.net.SocketInputStream.read(SocketInputStream.java:129)
>>>>>>         at
>>>>>> java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
>>>>>>         at
>>>>>> java.io.BufferedInputStream.read1(BufferedInputStream.java:258)
>>>>>>         at
>>>>>> java.io.BufferedInputStream.read(BufferedInputStream.java:317)
>>>>>>         - locked <0x00007fd7e77e0520> (a java.io.BufferedInputStream)
>>>>>>         at
>>>>>> org.apache.thrift.transport.TIOStreamTransport.read(TIOStreamTransport.java:126)
>>>>>>         at
>>>>>> org.apache.thrift.transport.TTransport.readAll(TTransport.java:84)
>>>>>>         at
>>>>>> org.apache.thrift.protocol.TBinaryProtocol.readAll(TBinaryProtocol.java:314)
>>>>>>         at
>>>>>> org.apache.thrift.protocol.TBinaryProtocol.readI32(TBinaryProtocol.java:262)
>>>>>>         at
>>>>>> org.apache.thrift.protocol.TBinaryProtocol.readMessageBegin(TBinaryProtocol.java:192)
>>>>>>         at
>>>>>> org.apache.cassandra.thrift.Cassandra$Processor.process(Cassandra.java:1154)
>>>>>>         at
>>>>>> org.apache.cassandra.thrift.CustomTThreadPoolServer$WorkerProcess.run(CustomTThreadPoolServer.java:167)
>>>>>>         at
>>>>>> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
>>>>>>         at
>>>>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
>>>>>>         at java.lang.Thread.run(Thread.java:662)
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Thu, Oct 28, 2010 at 12:44 PM, aaron morton <
>>>>>> aaron@thelastpickle.com> wrote:
>>>>>>
>>>>>>> The best approach is to manually select the tokens, see the Load
>>>>>>> Balancing section http://wiki.apache.org/cassandra/Operations
Also
>>>>>>>
>>>>>>> Are there any log messages in the existing nodes or the new one
which
>>>>>>> mention each other?
>>>>>>>
>>>>>>> Is this a production system? Is it still running ?
>>>>>>>
>>>>>>> Sorry there is not a lot to go on, it sounds like you've done
the
>>>>>>> right thing. I'm assuming things like the Cluster Name, seed
list and port
>>>>>>> numbers are set correct as the new node got some data.
>>>>>>>
>>>>>>> You'll need to dig through the logs a bit more to see that the
boot
>>>>>>> strapping started and what was the last message it logged.
>>>>>>>
>>>>>>> Good Luck.
>>>>>>> Aaron
>>>>>>>
>>>>>>> On 27 Oct 2010, at 22:40, Dimitry Lvovsky wrote:
>>>>>>>
>>>>>>> Hi Aaron,
>>>>>>> Thanks for your reply.
>>>>>>>
>>>>>>> We still haven't solved this unfortunately.
>>>>>>>
>>>>>>>  How did you start the bootstrap for the .18 node ?
>>>>>>>
>>>>>>>
>>>>>>> Standard way: we set "AutoBootstrap" to true and added all the
>>>>>>> servers from the working ring as seeds.
>>>>>>>
>>>>>>>
>>>>>>>> Was it the .18 or the .17 node you tried to add
>>>>>>>
>>>>>>>
>>>>>>> We first tried adding .17, it streamed for a while, took on a
50GB of
>>>>>>> load, stopped streaming but then didn't enter into the ring.
 We left it for
>>>>>>> a few days to see if it would come in, but no luck.  After that
we did
>>>>>>>  decommission and  removeToken ( in that order) operations.
>>>>>>> Since we couldn't get .17 in we tried again with .18.  Before
doing
>>>>>>> so we increased the RpcTimeoutInMillis from 1000, to 10000 having
read that
>>>>>>> this may cause the problem of nodes not entering into the ring.
  It's been
>>>>>>> going since friday and still, like .17, won't come into the ring.
>>>>>>>
>>>>>>> Does it have a token in the config or did you use nodetool move
to
>>>>>>>> set it
>>>>>>>
>>>>>>> No we didn't manually set the token in the config, rather we
were
>>>>>>> relaying on the token to be assigned durring bootstrap from the
>>>>>>> RandomPartitioner.
>>>>>>>
>>>>>>> Again thanks for the help.
>>>>>>>
>>>>>>> Dimitry.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Tue, Oct 26, 2010 at 10:14 PM, Aaron Morton <
>>>>>>> aaron@thelastpickle.com> wrote:
>>>>>>>
>>>>>>>> Dimitry, Did you get anywhere with this ?
>>>>>>>>
>>>>>>>> Was it the .18 or the .17 node you tried to add ? How did
you start
>>>>>>>> the bootstrap for the .18 node ? Does it have a token in
the config or did
>>>>>>>> you use nodetool move to set it?
>>>>>>>>
>>>>>>>> I had a quick look at the code AKAIK  the message about removing
the
>>>>>>>> fat client is logged when the node does not have a record
of the token the
>>>>>>>> other node as.
>>>>>>>>
>>>>>>>> Aaron
>>>>>>>>
>>>>>>>> On 26 Oct, 2010,at 10:42 PM, Dimitry Lvovsky <dimitry@reviewpro.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>> Hi All,
>>>>>>>> We recently upgraded from .65 to .66 after which we tried
adding a
>>>>>>>> new node to our cluster. We left it bootstrapping and after
3 days, it still
>>>>>>>> refused to join the ring. The strange thing is that nodetool
info shows 50GB
>>>>>>>> of load and nodetool ring shows that it sees the rest of
ring, which it is
>>>>>>>> not part of. We tried the process again with another server
-- again the
>>>>>>>> same thing as before:
>>>>>>>>
>>>>>>>>
>>>>>>>> //from machine 192.168.218
>>>>>>>>
>>>>>>>>
>>>>>>>> /opt/cassandra/bin/nodetool -h localhost -p 8999 info
>>>>>>>> 131373516047318302934572185119435768941
>>>>>>>> Load : 52.85 GB
>>>>>>>> Generation No : 1287761987
>>>>>>>> Uptime (seconds) : 323157
>>>>>>>> Heap Memory (MB) : 795.42 / 1945.63
>>>>>>>>
>>>>>>>>
>>>>>>>> /opt/cassandra/bin/nodetool -h localhost -p 8999 ring
>>>>>>>> Address Status Load Range Ring
>>>>>>>> 158573510920250391466717289405976537674
>>>>>>>> 192.168.2.22 Up 59.45 GB 28203205416427384773583427414698832202
>>>>>>>> |<--|
>>>>>>>> 192.168.2.23 Up 44.95 GB 60562227403709245514637766500430120055
| |
>>>>>>>> 192.168.2.20 Up 47.15 GB 104160057322065544623939416372654814065
| |
>>>>>>>> 192.168.2.21 Up 61.04 GB 158573510920250391466717289405976537674
>>>>>>>> |-->|
>>>>>>>>
>>>>>>>> opt/cassandra/bin/nodetool -h localhost -p 8999 streams
>>>>>>>> Mode: Bootstrapping
>>>>>>>> Not sending any streams.
>>>>>>>> Not receiving any streams.
>>>>>>>>
>>>>>>>>
>>>>>>>> Whats more, while looking at the log of one of the nodes
I see
>>>>>>>> gossip messages from 192.168.1.17 -- the first node we tried
to add to the
>>>>>>>> cluster but which is not running at the the time of the log
message:
>>>>>>>> INFO [Timer-0] 2010-10-26 02:13:20,340 Gossiper.java (line
406)
>>>>>>>> FatClient /192.168.2.17 has been silent for 3600000ms, removing
>>>>>>>> from gossip
>>>>>>>> INFO [GMFD:1] 2010-10-26 02:13:51,398 Gossiper.java (line
591) Node
>>>>>>>> /192.168.2.17 is now part of the cluster
>>>>>>>>
>>>>>>>>
>>>>>>>> Thanks in advance for the help,
>>>>>>>> Dimitry
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Dimitry Lvovsky
>>>>>>> Director of Engineering
>>>>>>> ReviewPro
>>>>>>> www.reviewpro.com
>>>>>>> +34 616 337 103
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Dimitry Lvovsky
>>>>> Director of Engineering
>>>>> ReviewPro
>>>>> www.reviewpro.com
>>>>> +34 616 337 103
>>>>>
>>>>
>>>>
>>>
>>
>
>
> --
> Dimitry Lvovsky
> Director of Engineering
> ReviewPro
> www.reviewpro.com
> +34 616 337 103
>

Mime
View raw message