cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Björn Hegerfors (JIRA) <j...@apache.org>
Subject [jira] [Created] (CASSANDRA-10510) Compated SSTables failing to get removed, overflowing disk
Date Tue, 13 Oct 2015 16:19:06 GMT
Björn Hegerfors created CASSANDRA-10510:
-------------------------------------------

             Summary: Compated SSTables failing to get removed, overflowing disk
                 Key: CASSANDRA-10510
                 URL: https://issues.apache.org/jira/browse/CASSANDRA-10510
             Project: Cassandra
          Issue Type: Bug
            Reporter: Björn Hegerfors
         Attachments: nonReleasedSSTables.txt

Short version: it appears that if the resulting SSTable of a compaction enters another compaction
soon after, the SSTables participating in the former compaction don't get deleted from disk
until Cassandra is restarted.

We have run into a big problem after applying CASSANDRA-10276 and CASSANDRA-10280, backported
to 2.0.14. But the bug we're seeing is not introduced by these patches, it has just made itself
very apparent and harmful.

Here's what has happened. We had repair running on our table that is a time series and uses
DTCS. The ring was split into 5016 small ranges being repaired one after the other (using
parallel repair, i.e. not snapshot repair). This causes a flood of tiny SSTables to get streamed
into all nodes (we don't use vnodes), with timestamp ranges similar to existing SSTables on
disk. The problem with that is the sheer number of SSTables, disk usage is not affected. This
has been reported before, see CASSANDRA-9644. These SSTables are streamed continuously for
up to a couple of days.

The patches were applied to fix the problem of ending up with tens of thousands of SSTables
that would never get touched by DTCS. But now that DTCS does touch them, we have run into
a new problem instead. While disk usage was in the 25-30% neighborhood before repairs began,
disk usage started growing fast when these continuous streams started coming in. Eventually,
a couple of nodes ran out of disk, which led us to stop all the repairing on the cluster.

This didn't reduce the disk usage. Compactions were of course very active. More than doubling
disk usage should not be possible, regardless of the choices your compaction strategy makes.
And we were not getting magnitudes of data streamed in. Large quantities of SSTables, yes,
but this was the nodes creating more data out of thin air.

We have a tool to show timestamp and size metadata of SSTables. What we found, looking at
all non-tmp data files, was something akin to duplicates of almost all the largest SSTables.
Not quite exact replicas, but there were these multi-gigabyte SSTables covering exactly the
same range of timestamps and differing in size by mere kilobytes. There were typically 3 of
each of the largest SSTables, sometimes even more.

Here's what I suspect: DTCS is the only compaction strategy that would commonly finish compacting
a really large SSTable and on the very next run of the compaction strategy nominate the result
for yet another compaction. Even together with tiny SSTables, which certainly happens in our
scenario. Potentially, the large SSTable that participated in the first compaction might even
get nominated again by DTCS, if for some reason it can be returned by getUncompactingSSTables.

Whatever the reason, I have collected evidence showing that these large "duplicate" SSTables
are of the same "lineage". Only one should remain on disk: the latest one. The older ones
have already been compacted, resulting in the newer ones. But for some reason, they never
got deleted from disk. And this was really harmful when combining DTCS with continuously streaming
in tiny SSTables. The same but worse would happen without the patches and uncapped max_sstable_age_days.

Attached is one occurrence of 3 duplicated SSTables, their metadata and log lines about their
compactions. You can see how similar they were to each other. SSTable generations 374277,
374249, 373702 (the large one), 374305, 374231 and 374333 completed compaction at 04:05:26,878,
yet they were all still on disk over 6 hours later. At 04:05:26,898 the result, 374373, entered
another compaction with 375174. They also stayed around after that compaction finished. Literally
all SSTables named in these log lines were still on disk when I checked! Only one should have
remained: 375189.

Now this was just one random example from the data I collected. This happened everywhere.
Some SSTables should probably have been deleted a day before.

However, once we restarted the nodes, all of the duplicates were suddenly gone!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message