cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From laxmikanth sadula <laxmikanth...@gmail.com>
Subject Re: New node block in autobootstrap
Date Tue, 27 Sep 2016 17:07:03 GMT
@Paulo Motta

Even we are facing Streaming timeout exceptions during 'nodetool rebuild' ,
I set streaming_socket_timeout_in_ms to 86400000 (24 hours) as suggested in
datastax blog  - https://support.datastax.com/hc/en-us/articles/206502913-
FAQ-How-to-reduce-the-impact-of-streaming-errors-or-failures  , but still
we are getting streaming exceptions.

And what is the suggestible settings/values for kernel tcp_keepalive which
would help streaming succeed ?

Thank you

On Tue, Aug 16, 2016 at 12:21 AM, Paulo Motta <pauloricardomg@gmail.com>
wrote:

> What version are you in? This seems like a typical case were there was a
> problem with streaming (hanging, etc), do you have access to the logs?
> Maybe look for streaming errors? Typically streaming errors are related to
> timeouts, so you should review your cassandra
> streaming_socket_timeout_in_ms and kernel tcp_keepalive settings.
>
> If you're on 2.2+ you can resume a failed bootstrap with nodetool
> bootstrap resume. There were also some streaming hanging problems fixed
> recently, so I'd advise you to upgrade to the latest version of your
> particular series for a more robust version.
>
> Is there any reason why you didn't use the replace procedure
> (-Dreplace_address) to replace the node with the same tokens? This would be
> a bit faster than remove + bootstrap procedure.
>
> 2016-08-15 15:37 GMT-03:00 Jérôme Mainaud <jerome@mainaud.com>:
>
>> Hello,
>>
>> A client of mime have problems when adding a node in the cluster.
>> After 4 days, the node is still in joining mode, it doesn't have the same
>> level of load than the other and there seems to be no streaming from and to
>> the new node.
>>
>> This node has a history.
>>
>>    1. At the begin, it was in a seed in the cluster.
>>    2. Ops detected that client had problems with it.
>>    3. They tried to reset it but failed. In their process they launched
>>    several repair and rebuild process on the node.
>>    4. Then they asked me to help them.
>>    5. We stopped the node,
>>    6. removed it from the list of seeds (more precisely it was replaced
>>    by another node),
>>    7. removed it from the cluster (I choose not to use decommission
>>    since node data was compromised)
>>    8. deleted all files from data, commitlog and savedcache directories.
>>    9. after the leaving process ended, it was started as a fresh new
>>    node and began autobootstrap.
>>
>>
>> As I don’t have direct access to the cluster I don't have a lot of
>> information, but I will have tomorrow (logs and results of some commands).
>> And I can ask for people any required information.
>>
>> Does someone have any idea of what could have happened and what I should
>> investigate first ?
>> What would you do to unlock the situation ?
>>
>> Context: The cluster consists of two DC, each with 15 nodes. Average load
>> is around 3 TB per node. The joining node froze a little after 2 TB.
>>
>> Thank you for your help.
>> Cheers,
>>
>>
>> --
>> Jérôme Mainaud
>> jerome@mainaud.com
>>
>
>


-- 
Regards,
Laxmikanth
99621 38051

Mime
View raw message