cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Wei Deng (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (CASSANDRA-12526) For LCS, single SSTable up-level is handled inefficiently
Date Thu, 01 Sep 2016 21:30:20 GMT

     [ https://issues.apache.org/jira/browse/CASSANDRA-12526?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Wei Deng updated CASSANDRA-12526:
---------------------------------
    Description: 
I'm using the latest trunk (as of August 2016, which probably is going to be 3.10) to run
some experiments on LeveledCompactionStrategy and noticed this inefficiency.

The test data is generated using cassandra-stress default parameters (keyspace1.standard1),
so as you can imagine, it consists of a ton of newly inserted partitions that will never merge
in compactions, which is probably the worst kind of workload for LCS (however, I'll detail
later why this scenario should not be ignored as a corner case; for now, let's just assume
we still want to handle this scenario efficiently).

After the compaction test is done, I scrubbed debug.log for patterns that match  the "Compacted"
summary so that I can see how long each individual compaction took and how many bytes they
processed. The search pattern is like the following:

{noformat}
grep 'Compacted.*standard1' debug.log
{noformat}

Interestingly, I noticed a lot of the finished compactions are marked as having *only one*
SSTable involved. With the workload mentioned above, the "single SSTable" compactions actually
consist of the majority of all compactions (as shown below), so its efficiency can affect
the overall compaction throughput quite a bit.

{noformat}
automaton@0ce59d338-1:~/cassandra-trunk/logs$ grep 'Compacted.*standard1' debug.log-test1
| wc -l
243
automaton@0ce59d338-1:~/cassandra-trunk/logs$ grep 'Compacted.*standard1' debug.log-test1
| grep ") 1 sstable" | wc -l
218
{noformat}

By looking at the code, it appears that there's a way to directly edit the level of a particular
SSTable like the following:

{code}
sstable.descriptor.getMetadataSerializer().mutateLevel(sstable.descriptor, targetLevel);
sstable.reloadSSTableMetadata();
{code}

To be exact, I summed up the time spent for these single-SSTable compactions (the total data
size is 60GB) and found that if each compaction only needs to spend 100ms for only the metadata
change (instead of the 10+ second they're doing now), it can already achieve 22.75% saving
on total compaction time.

Compared to what we have now (reading the whole single-SSTable from old level and writing
out the same single-SSTable at the new level), the only difference I could think of by using
this approach is that the new SSTable will have the same file name (sequence number) as the
old one's, which could break some assumptions on some other part of the code. However, not
having to go through the full read/write IO, and not having to bear the overhead of cleaning
up the old file, creating the new file, creating more churns in heap and file buffer, it seems
the benefits outweigh the inconvenience. So I'd argue this JIRA belongs to LHF and should
be made available in 3.0.x as well.

As mentioned in the 2nd paragraph, I'm also going to address why this kind of all-new-partition
workload should not be ignored as a corner case. Basically, for the main use case of LCS where
you need to frequently merge partitions to optimize read and eliminate tombstones and expired
data sooner, LCS can be perfectly happy and efficiently perform the partition merge and tombstone
elimination for a long time. However, as soon as the node becomes a bit unhealthy for various
reasons (could be a bad disk so it's missing a whole bunch of mutations and need repair, could
be the user chooses to ingest way more data than it usually takes and exceeds its capability,
or god-forbidden, some DBA chooses to run offline sstablelevelreset), you will have to handle
this kind of "all-new-partition with a lot of SSTables in L0" scenario, and once all L0 SSTables
finally gets up-leveled to L1, you will likely see a lot of such single-SSTable compactions,
which is the situation this JIRA is intended to address.

Actually, when I think more about this, to make this kind of single SSTable up-level more
efficient will not only help the all-new-partition scenario, but also help in general any
time when there is a big backlog of L0 SSTables due to too many flushes or excessive repair
streaming with vnode. In those situations, by default STCS_in_L0 will be triggered, and you
will end up getting a bunch of much bigger L0 SSTables after STCS is done. When it's time
to up-level those much bigger L0 SSTables most likely they will overlap among themselves and
you will add them all into your compaction session (along with all overlapped L1 SSTables).
For these much bigger L0 SSTables, they have gone through a few rounds of STCS compactions,
so if there's partition merge that needs to be done because fragments of the same partition
are dispersed in smaller L0 SSTables earlier, after those STCS rounds, what you end up having
in those much bigger L0 SSTables (generated by STCS) will not have much more opportunity for
partition merge to happen, so we're in a scenario very similar to L0 data "consists of a ton
of newly inserted partitions that will never merge in compactions" mentioned earlier.

  was:
I'm using the latest trunk (as of August 2016, which probably is going to be 3.10) to run
some experiments on LeveledCompactionStrategy and noticed this inefficiency.

The test data is generated using cassandra-stress default parameters (keyspace1.standard1),
so as you can imagine, it consists of a ton of newly inserted partitions that will never merge
in compactions, which is probably the worst kind of workload for LCS (however, I'll detail
later why this scenario should not be ignored as a corner case; for now, let's just assume
we still want to handle this scenario efficiently).

After the compaction test is done, I scrubbed debug.log for patterns that match  the "Compacted"
summary so that I can see how long each individual compaction took and how many bytes they
processed. The search pattern is like the following:

{noformat}
grep 'Compacted.*standard1' debug.log
{noformat}

Interestingly, I noticed a lot of the finished compactions are marked as having *only one*
SSTable involved. With the workload mentioned above, the "single SSTable" compactions actually
consist of the majority of all compactions (as shown below), so its efficiency can affect
the overall compaction throughput quite a bit.

{noformat}
automaton@0ce59d338-1:~/cassandra-trunk/logs$ grep 'Compacted.*standard1' debug.log-test1
| wc -l
243
automaton@0ce59d338-1:~/cassandra-trunk/logs$ grep 'Compacted.*standard1' debug.log-test1
| grep ") 1 sstable" | wc -l
218
{noformat}

By looking at the code, it appears that there's a way to directly edit the level of a particular
SSTable like the following:

{code}
sstable.descriptor.getMetadataSerializer().mutateLevel(sstable.descriptor, targetLevel);
sstable.reloadSSTableMetadata();
{code}

To be exact, I summed up the time spent for these single-SSTable compactions (the total data
size is 60GB) and found that if each compaction only needs to spend 100ms for only the metadata
change (instead of the 10+ second they're doing now), it can already achieve 22.75% saving
on total compaction time.

Compared to what we have now (reading the whole single-SSTable from old level and writing
out the same single-SSTable at the new level), the only difference I could think of by using
this approach is that the new SSTable will have the same file name (sequence number) as the
old one's, which could break some assumptions on some other part of the code. However, not
having to go through the full read/write IO, and not having to bear the overhead of cleaning
up the old file, creating the new file, creating more churns in heap and file buffer, it seems
the benefits outweigh the inconvenience. So I'd argue this JIRA belongs to LHF and should
be made available in 3.0.x as well.

As mentioned in the 2nd paragraph, I'm also going to address why this kind of all-new-partition
workload should not be ignored as a corner case. Basically, for the main use case of LCS where
you need to frequently merge partitions to optimize read and eliminate tombstones and expired
data sooner, LCS can be perfectly happy and efficiently perform the partition merge and tombstone
elimination for a long time. However, as soon as the node becomes a bit unhealthy for various
reasons (could be a bad disk so it's missing a whole bunch of mutations and need repair, could
be the user chooses to ingest way more data than it usually takes and exceeds its capability,
or god-forbidden, some DBA chooses to run offline sstablelevelreset), you will have to handle
this kind of "all-new-partition with a lot of SSTables in L0" scenario, and once all L0 SSTables
finally gets up-leveled to L1, you will likely see a lot of such single-SSTable compactions,
which is the situation this JIRA is intended to address.


> For LCS, single SSTable up-level is handled inefficiently
> ---------------------------------------------------------
>
>                 Key: CASSANDRA-12526
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-12526
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Compaction
>            Reporter: Wei Deng
>              Labels: compaction, lcs, performance
>
> I'm using the latest trunk (as of August 2016, which probably is going to be 3.10) to
run some experiments on LeveledCompactionStrategy and noticed this inefficiency.
> The test data is generated using cassandra-stress default parameters (keyspace1.standard1),
so as you can imagine, it consists of a ton of newly inserted partitions that will never merge
in compactions, which is probably the worst kind of workload for LCS (however, I'll detail
later why this scenario should not be ignored as a corner case; for now, let's just assume
we still want to handle this scenario efficiently).
> After the compaction test is done, I scrubbed debug.log for patterns that match  the
"Compacted" summary so that I can see how long each individual compaction took and how many
bytes they processed. The search pattern is like the following:
> {noformat}
> grep 'Compacted.*standard1' debug.log
> {noformat}
> Interestingly, I noticed a lot of the finished compactions are marked as having *only
one* SSTable involved. With the workload mentioned above, the "single SSTable" compactions
actually consist of the majority of all compactions (as shown below), so its efficiency can
affect the overall compaction throughput quite a bit.
> {noformat}
> automaton@0ce59d338-1:~/cassandra-trunk/logs$ grep 'Compacted.*standard1' debug.log-test1
| wc -l
> 243
> automaton@0ce59d338-1:~/cassandra-trunk/logs$ grep 'Compacted.*standard1' debug.log-test1
| grep ") 1 sstable" | wc -l
> 218
> {noformat}
> By looking at the code, it appears that there's a way to directly edit the level of a
particular SSTable like the following:
> {code}
> sstable.descriptor.getMetadataSerializer().mutateLevel(sstable.descriptor, targetLevel);
> sstable.reloadSSTableMetadata();
> {code}
> To be exact, I summed up the time spent for these single-SSTable compactions (the total
data size is 60GB) and found that if each compaction only needs to spend 100ms for only the
metadata change (instead of the 10+ second they're doing now), it can already achieve 22.75%
saving on total compaction time.
> Compared to what we have now (reading the whole single-SSTable from old level and writing
out the same single-SSTable at the new level), the only difference I could think of by using
this approach is that the new SSTable will have the same file name (sequence number) as the
old one's, which could break some assumptions on some other part of the code. However, not
having to go through the full read/write IO, and not having to bear the overhead of cleaning
up the old file, creating the new file, creating more churns in heap and file buffer, it seems
the benefits outweigh the inconvenience. So I'd argue this JIRA belongs to LHF and should
be made available in 3.0.x as well.
> As mentioned in the 2nd paragraph, I'm also going to address why this kind of all-new-partition
workload should not be ignored as a corner case. Basically, for the main use case of LCS where
you need to frequently merge partitions to optimize read and eliminate tombstones and expired
data sooner, LCS can be perfectly happy and efficiently perform the partition merge and tombstone
elimination for a long time. However, as soon as the node becomes a bit unhealthy for various
reasons (could be a bad disk so it's missing a whole bunch of mutations and need repair, could
be the user chooses to ingest way more data than it usually takes and exceeds its capability,
or god-forbidden, some DBA chooses to run offline sstablelevelreset), you will have to handle
this kind of "all-new-partition with a lot of SSTables in L0" scenario, and once all L0 SSTables
finally gets up-leveled to L1, you will likely see a lot of such single-SSTable compactions,
which is the situation this JIRA is intended to address.
> Actually, when I think more about this, to make this kind of single SSTable up-level
more efficient will not only help the all-new-partition scenario, but also help in general
any time when there is a big backlog of L0 SSTables due to too many flushes or excessive repair
streaming with vnode. In those situations, by default STCS_in_L0 will be triggered, and you
will end up getting a bunch of much bigger L0 SSTables after STCS is done. When it's time
to up-level those much bigger L0 SSTables most likely they will overlap among themselves and
you will add them all into your compaction session (along with all overlapped L1 SSTables).
For these much bigger L0 SSTables, they have gone through a few rounds of STCS compactions,
so if there's partition merge that needs to be done because fragments of the same partition
are dispersed in smaller L0 SSTables earlier, after those STCS rounds, what you end up having
in those much bigger L0 SSTables (generated by STCS) will not have much more opportunity for
partition merge to happen, so we're in a scenario very similar to L0 data "consists of a ton
of newly inserted partitions that will never merge in compactions" mentioned earlier.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message