cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jake Luciani <jak...@gmail.com>
Subject Re: Bootstrapping taking long
Date Tue, 04 Jan 2011 14:08:25 GMT
In 0.6, locate the node doing anti-compaction and look in the "streams"
subdirectory in the keyspace data dir to monitor the anti-compaction
progress (it puts new SSTables for bootstrapping node in there)

On Tue, Jan 4, 2011 at 8:01 AM, Ran Tavory <rantav@gmail.com> wrote:

> Running nodetool decommission didn't help. Actually the node refused to
> decommission itself (b/c it wasn't part of the ring). So I simply stopped
> the process, deleted all the data directories and started it again. It
> worked in the sense of the node bootstrapped again but as before, after it
> had finished moving the data nothing happened for a long time (I'm still
> waiting, but nothing seems to be happening).
>
> Any hints how to analyze a "stuck" bootstrapping node??
> thanks
>
> On Tue, Jan 4, 2011 at 1:51 PM, Ran Tavory <rantav@gmail.com> wrote:
>
>> Thanks Shimi, so indeed anticompaction was run on one of the other nodes
>> from the same DC but to my understanding it has already ended. A few hour
>> ago...
>> I plenty of log messages such as [1] which ended a couple of hours ago,
>> and I've seen the new node streaming and accepting the data from the node
>> which performed the anticompaction and so far it was normal so it seemed
>> that data is at its right place. But now the new node seems sort of stuck.
>> None of the other nodes is anticompacting right now or had been
>> anticompacting since then.
>> The new node's CPU is close to zero, it's iostats are almost zero so I
>> can't find another bottleneck that would keep it hanging.
>>
>> On the IRC someone suggested I'd maybe retry to join this node,
>> e.g. decommission and rejoin it again. I'll try it now...
>>
>>
>> [1]
>>  INFO [COMPACTION-POOL:1] 2011-01-04 04:04:09,721 CompactionManager.java
>> (line 338) AntiCompacting
>> [org.apache.cassandra.io.SSTableReader(path='/outbrain/cassandra/data/outbrain_kvdb/KvAds-6449-Data.db')]
>>  INFO [COMPACTION-POOL:1] 2011-01-04 04:34:18,683 CompactionManager.java
>> (line 338) AntiCompacting
>> [org.apache.cassandra.io.SSTableReader(path='/outbrain/cassandra/data/outbrain_kvdb/KvImpressions-3874-Data.db'),org.apache.cassandra.io.SSTableReader(path='/outbrain/cassandra/data/outbrain_kvdb/KvImpressions-3873-Data.db'),org.apache.cassandra.io.SSTableReader(path='/outbrain/cassandra/data/outbrain_kvdb/KvImpressions-3876-Data.db')]
>>  INFO [COMPACTION-POOL:1] 2011-01-04 04:34:19,132 CompactionManager.java
>> (line 338) AntiCompacting
>> [org.apache.cassandra.io.SSTableReader(path='/outbrain/cassandra/data/outbrain_kvdb/KvRatings-951-Data.db'),org.apache.cassandra.io.SSTableReader(path='/outbrain/cassandra/data/outbrain_kvdb/KvRatings-976-Data.db'),org.apache.cassandra.io.SSTableReader(path='/outbrain/cassandra/data/outbrain_kvdb/KvRatings-978-Data.db')]
>>  INFO [COMPACTION-POOL:1] 2011-01-04 04:34:26,486 CompactionManager.java
>> (line 338) AntiCompacting
>> [org.apache.cassandra.io.SSTableReader(path='/outbrain/cassandra/data/outbrain_kvdb/KvAds-6449-Data.db')]
>>
>> On Tue, Jan 4, 2011 at 12:45 PM, shimi <shimi.k@gmail.com> wrote:
>>
>>> In my experience most of the time it takes for a node to join the cluster
>>> is the anticompaction on the other nodes. The streaming part is very fast.
>>> Check the other nodes logs to see if there is any node doing
>>> anticompaction.
>>> I don't remember how much data I had in the cluster when I needed to
>>> add/remove nodes. I do remember that it took a few hours.
>>>
>>> The node will join the ring only when it will finish the bootstrap.
>>>
>>> Shimi
>>>
>>>
>>> On Tue, Jan 4, 2011 at 12:28 PM, Ran Tavory <rantav@gmail.com> wrote:
>>>
>>>> I asked the same question on the IRC but no luck there, everyone's
>>>> asleep ;)...
>>>>
>>>> Using 0.6.6 I'm adding a new node to the cluster.
>>>> It starts out fine but then gets stuck on the bootstrapping state for
>>>> too long. More than an hour and still counting.
>>>>
>>>> $ bin/nodetool -p 9004 -h localhost streams
>>>>> Mode: Bootstrapping
>>>>> Not sending any streams.
>>>>> Not receiving any streams.
>>>>
>>>>
>>>> It seemed to have streamed data from other nodes and indeed the load is
>>>> non-zero but I'm not clear what's keeping it right now from finishing.
>>>>
>>>>> $ bin/nodetool -p 9004 -h localhost info
>>>>> 51042355038140769519506191114765231716
>>>>> Load             : 22.49 GB
>>>>> Generation No    : 1294133781
>>>>> Uptime (seconds) : 1795
>>>>> Heap Memory (MB) : 315.31 / 6117.00
>>>>
>>>>
>>>> nodetool ring does not list this new node in the ring, although nodetool
>>>> can happily talk to the new node, it's just not listing itself as a member
>>>> of the ring. This is expected when the node is still bootstrapping, so the
>>>> question is still how long might the bootstrap take and whether is it stuck.
>>>>
>>>> The data ins't huge so I find it hard to believe that streaming or anti
>>>> compaction are the bottlenecks. I have ~20G on each node and the new node
>>>> already has just about that so it seems that all data had already been
>>>> streamed to it successfully, or at least most of the data... So what is it
>>>> waiting for now? (same question, rephrased... ;)
>>>>
>>>> I tried:
>>>> 1. Restarting the new node. No good. All logs seem normal but at the end
>>>> the node is still in bootstrap mode.
>>>> 2. As someone suggested I increased the rpc timeout from 10k to 30k
>>>> (RpcTimeoutInMillis) but that didn't seem to help. I did this only on the
>>>> new node. Should I have done that on all (old) nodes as well? Or maybe only
>>>> on the ones that were supposed to stream data to that node.
>>>> 3. Logging level at DEBUG now but nothing interesting going on except
>>>> for occasional messages such as [1] or [2]
>>>>
>>>> So the question is: what's keeping the new node from finishing the
>>>> bootstrap and how can I check its status?
>>>> Thanks
>>>>
>>>> [1] DEBUG [Timer-1] 2011-01-04 05:21:24,402 LoadDisseminator.java (line
>>>> 36) Disseminating load info ...
>>>> [2] DEBUG [RMI TCP Connection(22)-192.168.252.88] 2011-01-04
>>>> 05:12:48,033 StorageService.java (line 1189) computing ranges for
>>>> 28356863910078205288614550619314017621,
>>>> 56713727820156410577229101238628035242,
>>>>  85070591730234615865843651857942052863,
>>>> 113427455640312821154458202477256070484,
>>>> 141784319550391026443072753096570088105,
>>>> 170141183460469231731687303715884105727
>>>>
>>>> --
>>>> /Ran
>>>>
>>>>
>>>
>>
>>
>> --
>> /Ran
>>
>>
>
>
> --
> /Ran
>
>

Mime
View raw message