cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Georg Brandemann <georg.brandem...@gmail.com>
Subject Re: Aws instance stop and star with ebs
Date Fri, 29 Nov 2019 11:36:59 GMT
Hi Rahul

Also have a look at  https://issues.apache.org/jira/browse/CASSANDRA-14358 .
We saw this on a 2.1.x cluster and there it also took ~10 minutes till the
restarted node was really fully available in the cluster. the echo ACKs
from some nodes simply seemed to never reach the target

Georg

Am Mi., 6. Nov. 2019 um 21:41 Uhr schrieb Rahul Reddy <
rahulreddy1234@gmail.com>:

> Thanks Daemeon ,
>
> will do that and post the results.
> I found jira in open state with similar issue
> https://issues.apache.org/jira/browse/CASSANDRA-13984
>
> On Wed, Nov 6, 2019 at 1:49 PM daemeon reiydelle <daemeonr@gmail.com>
> wrote:
>
>> No connection timeouts? No tcp level retries? I am sorry truly sorry but
>> you have exceeded my capability. I have never seen a java.io timeout
>> with out either a session half open failure (no response) or multiple
>> retries.
>>
>> I am out of my depth, so please feel free to ignore but, did you see the
>> packets that are making the initial connection (which must have timed out)?
>> Out of curiosity, a netstat -arn must be showing bad packets, timeouts,
>> etc. To see progress, create a simple shell script that dumps date, dumps
>> netstat, sleeps 100 seconds, repeated. During that window stop, wait 10
>> seconds, restart the remove node.
>>
>> <======>
>> Made weak by time and fate, but strong in will,
>> To strive, to seek, to find, and not to yield.
>> Ulysses - A. Lord Tennyson
>>
>> *Daemeon C.M. Reiydelle*
>>
>> *email: daemeonr@gmail.com <daemeonr@gmail.com>*
>> *San Francisco 1.415.501.0198/Skype daemeon.c.m.reiydelle*
>>
>>
>>
>> On Wed, Nov 6, 2019 at 9:11 AM Rahul Reddy <rahulreddy1234@gmail.com>
>> wrote:
>>
>>> Thank you.
>>>
>>> I have stopped instance in east. i see that all other instances can
>>> gossip to that instance and only one instance in west having issues
>>> gossiping to that node.  when i enable debug mode i see below on the west
>>> node
>>>
>>> i see bellow messages from 16:32 to 16:47
>>> DEBUG [RMI TCP Connection(272)-127.0.0.1] 2019-11-06 16:44:50,
>>> 417 StorageProxy.java:2361 - Hosts not in agreement. Didn't get a
>>> response from everybody:
>>> 424 StorageProxy.java:2361 - Hosts not in agreement. Didn't get a
>>> response from everybody:
>>>
>>> later i see timeout
>>>
>>> DEBUG [MessagingService-Outgoing-/eastip-Gossip] 2019-11-06 16:47:04,831
>>> OutboundTcpConnection.java:350 - Error writing to /eastip
>>> java.io.IOException: Connection timed out
>>>
>>> then  INFO  [GossipStage:1] 2019-11-06 16:47:05,792 StorageService.j
>>> ava:2289 - Node /eastip state jump to NORMAL
>>>
>>> DEBUG [GossipStage:1] 2019-11-06 16:47:06,244 MigrationManager
>>> .java:99 - Not pulling schema from /eastip, because sche
>>> ma versions match: local/real=cdbb639b-1675-31b3-8a0d-84aca18e
>>> 86bf, local/compatible=49bf1daa-d585-38e0-a72b-b36ce82da9cb, r
>>> emote=cdbb639b-1675-31b3-8a0d-84aca18e86bf
>>>
>>> i tried running some tcpdump during that time i dont see any packet loss
>>> during that time.  still unsure why east instance which was stopped and
>>> started unreachable to west node almost for 15 minutes.
>>>
>>>
>>> On Tue, Nov 5, 2019 at 10:14 PM daemeon reiydelle <daemeonr@gmail.com>
>>> wrote:
>>>
>>>> 10 minutes is 600 seconds, and there are several timeouts that are set
>>>> to that, including the data center timeout as I recall.
>>>>
>>>> You may be forced to tcpdump the interface(s) to see where the chatter
>>>> is. Out of curiosity, when you restart the node, have you snapped the jvm's
>>>> memory to see if e.g. heap is even in use?
>>>>
>>>>
>>>> On Tue, Nov 5, 2019 at 7:03 PM Rahul Reddy <rahulreddy1234@gmail.com>
>>>> wrote:
>>>>
>>>>> Thanks Ben,
>>>>> Before stoping the ec2 I did run nodetool drain .so i ruled it out and
>>>>> system.log also doesn't show commitlogs being applied.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Tue, Nov 5, 2019, 7:51 PM Ben Slater <ben.slater@instaclustr.com>
>>>>> wrote:
>>>>>
>>>>>> The logs between first start and handshaking should give you a
>>>>>> clue but my first guess would be replaying commit logs.
>>>>>>
>>>>>> Cheers
>>>>>> Ben
>>>>>>
>>>>>> ---
>>>>>>
>>>>>>
>>>>>> *Ben Slater**Chief Product Officer*
>>>>>>
>>>>>> <https://www.instaclustr.com/platform/>
>>>>>>
>>>>>> <https://www.facebook.com/instaclustr>
>>>>>> <https://twitter.com/instaclustr>
>>>>>> <https://www.linkedin.com/company/instaclustr>
>>>>>>
>>>>>> Read our latest technical blog posts here
>>>>>> <https://www.instaclustr.com/blog/>.
>>>>>>
>>>>>> This email has been sent on behalf of Instaclustr Pty. Limited
>>>>>> (Australia) and Instaclustr Inc (USA).
>>>>>>
>>>>>> This email and any attachments may contain confidential and legally
>>>>>> privileged information.  If you are not the intended recipient, do
not copy
>>>>>> or disclose its content, but please reply to this email immediately
and
>>>>>> highlight the error to the sender and then immediately delete the
message.
>>>>>>
>>>>>>
>>>>>> On Wed, 6 Nov 2019 at 04:36, Rahul Reddy <rahulreddy1234@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> I can reproduce the issue.
>>>>>>>
>>>>>>> I did drain Cassandra node then stop and started Cassandra instance
>>>>>>> . Cassandra instance comes up but other nodes will be in DN state
around 10
>>>>>>> minutes.
>>>>>>>
>>>>>>> I don't see error in the systemlog
>>>>>>>
>>>>>>> DN  xx.xx.xx.59   420.85 MiB  256          48.2%            
id  2
>>>>>>> UN  xx.xx.xx.30   432.14 MiB  256          50.0%            
id  0
>>>>>>> UN  xx.xx.xx.79   447.33 MiB  256          51.1%            
id  4
>>>>>>> DN  xx.xx.xx.144  452.59 MiB  256          51.6%            
id  1
>>>>>>> DN  xx.xx.xx.19   431.7 MiB  256          50.1%             id
 5
>>>>>>> UN  xx.xx.xx.6    421.79 MiB  256          48.9%
>>>>>>>
>>>>>>> when i do nodetool status 3 nodes still showing down. and i dont
see
>>>>>>> errors in system.log
>>>>>>>
>>>>>>> and after 10 mins it shows the other node is up as well.
>>>>>>>
>>>>>>>
>>>>>>> INFO  [HANDSHAKE-/10.72.100.156] 2019-11-05 15:05:09,133
>>>>>>> OutboundTcpConnection.java:561 - Handshaking version with /stopandstarted
>>>>>>> node
>>>>>>> INFO  [RequestResponseStage-7] 2019-11-05 15:16:27,166
>>>>>>> Gossiper.java:1019 - InetAddress /nodewhichitwasshowing down
is now UP
>>>>>>>
>>>>>>> what is causing delay for 10mins to be able to say that node
is
>>>>>>> reachable
>>>>>>>
>>>>>>> On Wed, Oct 30, 2019, 8:37 AM Rahul Reddy <rahulreddy1234@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> And also aws ec2 stop and start comes with new instance with
same
>>>>>>>> ip and all our file systems are in ebs mounted fine.  Does
coming new
>>>>>>>> instance with same ip cause any gossip issues?
>>>>>>>>
>>>>>>>> On Tue, Oct 29, 2019, 6:16 PM Rahul Reddy <rahulreddy1234@gmail.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Thanks Alex. We have 6 nodes in each DC with RF=3  with
CL local
>>>>>>>>> qourum . and we stopped and started only one instance
at a time . Tough
>>>>>>>>> nodetool status says all nodes UN and system.log says
canssandra started
>>>>>>>>> and started listening . Jmx explrter shows instance stayed
down longer how
>>>>>>>>> do we determine what caused  the Cassandra unavialbe
though log says its
>>>>>>>>> stared and listening ?
>>>>>>>>>
>>>>>>>>> On Tue, Oct 29, 2019, 4:44 PM Oleksandr Shulgin <
>>>>>>>>> oleksandr.shulgin@zalando.de> wrote:
>>>>>>>>>
>>>>>>>>>> On Tue, Oct 29, 2019 at 9:34 PM Rahul Reddy <
>>>>>>>>>> rahulreddy1234@gmail.com> wrote:
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> We have our infrastructure on aws and we use
ebs storage . And
>>>>>>>>>>> aws was retiring on of the node. Since our storage
was persistent we did
>>>>>>>>>>> nodetool drain and stopped and start the instance
. This caused 500 errors
>>>>>>>>>>> in the service. We have local_quorum and rf=3
why does stopping one
>>>>>>>>>>> instance cause application to have issues?
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Can you still look up what was the underlying error
from
>>>>>>>>>> Cassandra driver in the application logs?  Was it
request timeout or not
>>>>>>>>>> enough replicas?
>>>>>>>>>>
>>>>>>>>>> For example, if you only had 3 Cassandra nodes, restarting
one of
>>>>>>>>>> them reduces your cluster capacity by 33% temporarily.
>>>>>>>>>>
>>>>>>>>>> Cheers,
>>>>>>>>>> --
>>>>>>>>>> Alex
>>>>>>>>>>
>>>>>>>>>>

Mime
View raw message