cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From George Webster <webste...@gmail.com>
Subject Re: question of keyspace that just disappeared
Date Fri, 03 Mar 2017 18:08:53 GMT
I think it does on drop keyspace. We had a recent enough snapshot so it
wasn't a big deal to recover. However, we didn't have a snapshot for when
the keyspace disappeared.

@Romain: I believe you are correct about reliability. We just had a repair
--full fail and CPU lock up one of the nodes at 100%. This occurred on a
fairly new keyspace that only have writes. We also are now receiving a very
high percentage of read timeouts. ... might be time to rebuild the cluster.



On Fri, Mar 3, 2017 at 2:34 PM, Edward Capriolo <edlinuxguru@gmail.com>
wrote:

>
> On Fri, Mar 3, 2017 at 7:56 AM, Romain Hardouin <romainh_ml@yahoo.fr>
> wrote:
>
>> I suspect a lack of 3.x reliability. Cassandra could had gave up with
>> dropped messages but not with a "drop keyspace". I mean I already saw some
>> spark jobs with too much executors that produce a high load average on a
>> DC. I saw a C* node with a 1 min. load avg of 140 that can still have a P99
>> read latency at 40ms. But I never saw a disappearing keyspace. There are
>> old tickets regarding C* 1.x but as far as I remember it was due to a
>> create/drop/create keyspace.
>>
>>
>> Le Vendredi 3 mars 2017 13h44, George Webster <webstergd@gmail.com> a
>> écrit :
>>
>>
>> Thank you for your reply and good to know about the debug statement. I
>> haven't
>>
>> We never dropped or re-created the keyspace before. We haven't even
>> performed writes to that keyspace in months. I also checked the permissions
>> of Apache, that user had read only access.
>>
>> Unfortunately, I reverted from a backend recently. I cannot say for sure
>> anymore if I saw something in system before the revert.
>>
>> Anyway, hopefully it was just a fluke. We have some crazy ML libraries
>> running on it maybe Cassandra just gave up? Ohh well, Cassandra is a a
>> champ and we haven't really had issues with it before.
>>
>> On Thu, Mar 2, 2017 at 6:51 PM, Romain Hardouin <romainh_ml@yahoo.fr>
>> wrote:
>>
>> Did you inspect system tables to see if there is some traces of your
>> keyspace? Did you ever drop and re-create this keyspace before that?
>>
>> Lines in debug appear because fd interval is > 2 seconds (logs are in
>> nanoseconds). You can override intervals via -Dcassandra.fd_initial_value_
>> ms and -Dcassandra.fd_max_interval_ms properties. Are you sure you didn't
>> have these lines in debug logs before? I used to see them a lot prior to
>> increase intervals to 4 seconds.
>>
>> Best,
>>
>> Romain
>>
>> Le Mardi 28 février 2017 18h25, George Webster <webstergd@gmail.com> a
>> écrit :
>>
>>
>> Hey Cassandra Users,
>>
>> We recently encountered an issue with a keyspace just disappeared. I was
>> curious if anyone has had this occur before and can provide some insight.
>>
>> We are using cassandra 3.10. 2 DCs  3 nodes each.
>> The data was still located in the storage folder but is not located
>> inside Cassandra
>>
>> I searched the logs for any hints of error or commands being executed
>> that could have caused a loss of a keyspace. Unfortunately I found nothing.
>> In the logs the only unusual issue i saw was a series of read timeouts that
>> occurred right around when the keyspace went away. Since then I see
>> numerous entries in debug log as the following:
>>
>> DEBUG [GossipStage:1] 2017-02-28 18:14:12,580 FailureDetector.java:457 -
>> Ignoring interval time of 2155674599 for /x.x.x..12
>> DEBUG [GossipStage:1] 2017-02-28 18:14:16,580 FailureDetector.java:457 -
>> Ignoring interval time of 2945213745 for /x.x.x.81
>> DEBUG [GossipStage:1] 2017-02-28 18:14:19,590 FailureDetector.java:457 -
>> Ignoring interval time of 2006530862 for /x.x.x..69
>> DEBUG [GossipStage:1] 2017-02-28 18:14:27,434 FailureDetector.java:457 -
>> Ignoring interval time of 3441841231 for /x.x.x.82
>> DEBUG [GossipStage:1] 2017-02-28 18:14:29,588 FailureDetector.java:457 -
>> Ignoring interval time of 2153964846 for /x.x.x.82
>> DEBUG [GossipStage:1] 2017-02-28 18:14:33,582 FailureDetector.java:457 -
>> Ignoring interval time of 2588593281 for /x.x.x.82
>> DEBUG [GossipStage:1] 2017-02-28 18:14:37,588 FailureDetector.java:457 -
>> Ignoring interval time of 2005305693 for /x.x.x.69
>> DEBUG [GossipStage:1] 2017-02-28 18:14:38,592 FailureDetector.java:457 -
>> Ignoring interval time of 2009244850 for /x.x.x.82
>> DEBUG [GossipStage:1] 2017-02-28 18:14:43,584 FailureDetector.java:457 -
>> Ignoring interval time of 2149192677 for /x.x.x.69
>> DEBUG [GossipStage:1] 2017-02-28 18:14:45,605 FailureDetector.java:457 -
>> Ignoring interval time of 2021180918 for /x.x.x.85
>> DEBUG [GossipStage:1] 2017-02-28 18:14:46,432 FailureDetector.java:457 -
>> Ignoring interval time of 2436026101 for /x.x.x.81
>> DEBUG [GossipStage:1] 2017-02-28 18:14:46,432 FailureDetector.java:457 -
>> Ignoring interval time of 2436187894 for /x.x.x.82
>>
>> During the time of the disappearing keyspace we had two concurrent
>> activities:
>> 1) Running a Spark job (via HDP 2.5.3 in Yarn) that was performing a
>> countbykey. It was using they Keyspace that disappeared. The operation
>> crashed.
>> 2) We created a new keyspace to test out scheme. Only "fancy" thing in
>> that keyspace are a few material view tables. Data was being loaded into
>> that keyspace during the crash. The load process was extracting information
>> and then just writing to Cassandra.
>>
>> Any ideas? Anyone seen this before?
>>
>> Thanks,
>> George
>>
>>
>>
>>
>>
>>
> Cassandra takes snapshots for certain events. Does this extend to drop
> keyspace commands? Maybe it should.
>

Mime
View raw message