cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Benjamin Roth (JIRA)" <>
Subject [jira] [Commented] (CASSANDRA-12730) Thousands of empty SSTables created during repair - TMOF death
Date Sat, 05 Nov 2016 20:56:58 GMT


Benjamin Roth commented on CASSANDRA-12730:

After having thought a bit about that issue, I come up with the following proposal:

Given that it is necessary to flush memtables for streaming to have immuntable data and a
consistent state to stream, we only need to flush the memtable if it contains data for the
token range(s) that are about to be streamed.
To achieve this, StreamSession.prepare should not always pass true as "flushTables" but should
check if the affected memtables contain partitions for the requested range(s). To be able
to determine that, the Memtable class requires a method that takes a token range and then
iterates over its own partition keys and check if the passed token range contains that partition
key. To cap the resources that are required to determine that, it may make sense to skip that
check if a memtable contains very many partitions and a flush would be appropriate anyway.
This check shall not avoid a flush at all costs but should avoid excessive flushes. So maybe
there could be a limit of e.g. 1000 (TBD) partitions that are being checked and if the memtable
contains more partitions it is flushed without a range check. In other words, it is a more
sophisticated version of Memtable.isClean with the context a token range, like "isCleanForRange(Range<Token>)"

The impact on existing code should be quite small and the risk of a performance impact on
production systems should be quite small as well. In fact the risk of piling up thousands
of SSTables is much higher than a calculable overhead of some simple calculations.

How does that sound?

> Thousands of empty SSTables created during repair - TMOF death
> --------------------------------------------------------------
>                 Key: CASSANDRA-12730
>                 URL:
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Local Write-Read Paths
>            Reporter: Benjamin Roth
>            Priority: Critical
> Last night I ran a repair on a keyspace with 7 tables and 4 MVs each containing a few
hundret million records. After a few hours a node died because of "too many open files".
> Normally one would just raise the limit, but: We already set this to 100k. The problem
was that the repair created roughly over 100k SSTables for a certain MV. The strange thing
is that these SSTables had almost no data (like 53bytes, 90bytes, ...). Some of them (<5%)
had a few 100 KB, very few (<1% had normal sizes like >= few MB). I could understand,
that SSTables queue up as they are flushed and not compacted in time but then they should
have at least a few MB (depending on config and avail mem), right?
> Of course then the node runs out of FDs and I guess it is not a good idea to raise the
limit even higher as I expect that this would just create even more empty SSTables before
dying at last.
> Only 1 CF (MV) was affected. All other CFs (also MVs) behave sanely. Empty SSTables have
been created equally over time. 100-150 every minute. Among the empty SSTables there are also
Tables that look normal like having few MBs.
> I didn't see any errors or exceptions in the logs until TMOF occured. Just tons of streams
due to the repair (which I actually run over cs-reaper as subrange, full repairs).
> After having restarted that node (and no more repair running), the number of SSTables
went down again as they are compacted away slowly.
> According to [~zznate] this issue may relate to CASSANDRA-10342 + CASSANDRA-8641

This message was sent by Atlassian JIRA

View raw message