cassandra-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Tyler Hobbs <>
Subject Re: March 2015 QA retrospective
Date Fri, 10 Apr 2015 15:55:51 GMT
> Bloom Filter
> truePositive counter not updated on key cache hit
> * We could avoid this with extensive checking of metrics output after
> tests, having JMX available in dtests would be a good start
> I recently added some utilities for using JMX from the dtests:

On Fri, Apr 10, 2015 at 7:04 AM, Benedict Elliott Smith <> wrote:

> TL;DR: "Kitchen sink" (aggressive randomised stress with subsystem
> correctness) tests; commitlog/memtable isolated correctness stress testing;
> improved tool/utility testing; internal structural changes to prevent
> occurrence (delivered); fault injection testing. Filed #916[1-5]
> <> Benedict
> FileNotFoundException during STREAM-OUT triggers 100% CPU usage Streaming
> This particular class of bug should be near impossible, due to structural
> changes beginning with 7705. For testing such an uncommon race condition,
> we would hope it to be exhibited eventually by our kitchen sink aggressive
> testing, but it would be a very uncommon event.
> CASSANDRA-8383 <>
> Benedict Memtable
> flush may expire records from the commit log that are in a later memtable
> No
> regression test, no follow up ticket. Could/should this have been
> reproducable as an actual bug?
> As stated on the ticket, we need to introduce rigorous randomized testing
> of the commit log's correctness, both in isolation and in conjunction with
> memtable flushing. This is not a trivial undertaking. Whether or not it
> integrates with our kitchen sink tests is an open question, but I think
> that might be difficult. I've filed #9162 to track this.
> CASSANDRA-8429 <>
> Benedict
> Some keys unreadable during compaction
> Running stress in CI would have caught this, and we're going to do that
> CASSANDRA-8459 <>
> Benedict
> "autocompaction" on reads can prevent memtable space reclaimation
> Kitchen sink tests with sufficiently large partitions written over a
> sufficiently large period of time. Same risk present for e.g. secondary
> indexes, so aggressive coverage of these, including scans etc, important.
> CASSANDRA-8499 <>
> Benedict
> Ensure SSTableWriter cleans up properly after failure
> Testing error paths? Any way to test things in a loop to detect leaks?
> This kind of leak are now reported, and autocorrected for, so detecting is
> much easier. However fault injection testing (if we can find a good way for
> license compliance) as I started in CASSANDRA-8568 would help a lot also.
> CASSANDRA-8513 <>
> Benedict
> SSTableScanner may not acquire reference, but will still release it when
> closed
> This had a user visible component, what test could have caught it befor
> erelease?
> Again, this cannot happen now, due to internal structural changes to
> prevent it.
> CASSANDRA-8619 <
> > Benedict
> using CQLSSTableWriter gives ConcurrentModificationException
> Some better testing of our tools and utilities. The fix for this introduced
> its own bug, by the looks of it, which we also did not catch. Better
> (randomized long testing) coverage of these tools would help in both fixing
> and ensuring it doesn't return again.
> CASSANDRA-8632 <>
> Benedict
> cassandra-stress only generating a single unique row
> This was caught prior to release by developer use, which is currently the
> only QA we have for stress. Some basic testing would certainly be helpful,
> but there is a tension between getting stress to do useful things, and
> testing that it does so, since there are finite resources available to us.
> The utility is currently probably more pressing, given the eyes it gets
> when it is used. With more complex validation arriving, in conjunction with
> performance profile histories and its generally being employed as a dev
> tool, it should somewhat self test (major changes in performance profiles
> should be explicable else investigated, and critical mistakes should often
> lead to failed validation, or to users noticing a problem), and I expect
> this will have to suffice for the interim.
> CASSANDRA-8668 <>
> Benedict We don't enforce offheap memory constraints; regression
> introduced by 7882
> This would have been easily found with a kitchen sink test that was
> inserting large columns. We should probably also have some specific tests
> for ensuring the allocation tracking is exactly correct (by inspecting the
> whole object graph independently, and reconciling the values), but this is
> fiddly and of low immediate yield.
> CASSANDRA-8719 <>
> Benedict
> Using thrift HSHA with offheap_objects appears to corrupt data
> *Untested configuration before release, this would be straightforward if we
> ran with it? *
> Spot on.
> CASSANDRA-8726 <
> > Benedict
> throw OOM in Memory if we fail to allocate OOM
> Kind of tricky to induce an OOM; in general we consider an OOM to put C*
> into an unstable state as well, so correct behaviour is just to shut down,
> making it potentially tricky to test all avenues that could throw OOM.
> Possibly the best route is to modify the byte code to corrupt the return
> value to zero for each possible avenue we can reach it by, and confirm that
> shutdown occurs safely.

Tyler Hobbs
DataStax <>

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message