Mailing-List: contact commits-help@cassandra.apache.org; run by ezmlm
Precedence: bulk
Reply-To: dev@cassandra.apache.org
Date: Tue, 23 Aug 2016 22:34:20 +0000 (UTC)
From: "Wei Deng (JIRA)" <jira@apache.org>
To: commits@cassandra.apache.org
Message-ID: <JIRA.12999380.1471989814000.389531.1471991660567@Atlassian.JIRA>
In-Reply-To: <JIRA.12999380.1471989814000@Atlassian.JIRA>
References: <JIRA.12999380.1471989814000@Atlassian.JIRA> <JIRA.12999380.1471989814916@arcas>
Subject: [jira] [Updated] (CASSANDRA-12526) For LCS, single SSTable up-level
 is handled inefficiently
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: quoted-printable
archived-at: Tue, 23 Aug 2016 22:34:22 -0000


     [ https://issues.apache.org/jira/browse/CASSANDRA-12526?page=3Dcom.atl=
assian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Wei Deng updated CASSANDRA-12526:
---------------------------------
    Description:=20
I'm using the latest trunk (as of August 2016, which probably is going to b=
e 3.10) to run some experiments on LeveledCompactionStrategy and noticed th=
is inefficiency.

The test data is generated using cassandra-stress default parameters (keysp=
ace1.standard1), so as you can imagine, it consists of a ton of newly inser=
ted partitions that will never merge in compactions, which is probably the =
worst kind of workload for LCS (however, I'll detail later why this scenari=
o should not be ignored as a corner case; for now, let's just assume we sti=
ll want to handle this scenario efficiently).

After the compaction test is done, I scrubbed debug.log for patterns that m=
atch  the "Compacted" summary so that I can see how long each individual co=
mpaction took and how many bytes they processed. The search pattern is like=
 the following:

{noformat}
grep 'Compacted.*standard1' debug.log
{noformat}

Interestingly, I noticed a lot of the finished compactions are marked as ha=
ving *only one* SSTable involved. With the workload mentioned above, the "s=
ingle SSTable" compactions actually consist of the majority of all compacti=
ons (as shown below), so its efficiency can affect the overall compaction t=
hroughput quite a bit.

{noformat}
automaton@0ce59d338-1:~/cassandra-trunk/logs$ grep 'Compacted.*standard1' d=
ebug.log-test1 | wc -l
243
automaton@0ce59d338-1:~/cassandra-trunk/logs$ grep 'Compacted.*standard1' d=
ebug.log-test1 | grep ") 1 sstable" | wc -l
218
{noformat}

By looking at the code, it appears that there's a way to directly edit the =
level of a particular SSTable like the following:

{code}
sstable.descriptor.getMetadataSerializer().mutateLevel(sstable.descriptor, =
targetLevel);
sstable.reloadSSTableMetadata();
{code}

Compared to what we have now (reading the whole single-SSTable from old lev=
el and writing out the same single-SSTable at the new level), the only diff=
erence I could think of by using this approach is that the new SSTable will=
 have the same file name (sequence number) as the old one's, which could br=
eak some assumptions on some other part of the code. However, not having to=
 go through the full read/write IO, and not having to bear the overhead of =
cleaning up the old file, creating the new file, creating more churns in he=
ap and file buffer, it seems the benefits outweigh the inconvenience. So I'=
d argue this JIRA belongs to LHF and should be made available in 3.0.x as w=
ell.

As mentioned in the 2nd paragraph, I'm also going to address why this kind =
of all-new-partition workload should not be ignored as a corner case. Basic=
ally, for the main use case of LCS where you need to frequently merge parti=
tions to optimize read and eliminate tombstones and expired data sooner, LC=
S can be perfectly happy and efficiently perform the partition merge and to=
mbstone elimination for a long time. However, as soon as the node becomes a=
 bit unhealthy for various reasons (could be a bad disk so it's missing a w=
hole bunch of mutations and need repair, could be the user chooses to inges=
t way more data than it usually takes and exceeds its capability, or god-fo=
rbidden, some DBA chooses to run offline sstablelevelreset), you will have =
to handle this kind of "all-new-partition with a lot of SSTables in L0" sce=
nario, and once all L0 SSTables finally gets up-leveled to L1, you will lik=
ely see a lot of such single-SSTable compactions, which is the situation th=
is JIRA is intended to address.

  was:
I'm using the latest trunk (as of August 2016, which probably is going to b=
e 3.10) to run some experiments on LeveledCompactionStrategy and noticed th=
is inefficiency.

The test data is generated using cassandra-stress default parameters (keysp=
ace1.standard1), so as you can imagine, it consists of a ton of newly inser=
ted partitions that will never merge in compactions, which is probably the =
worst kind of workload for LCS (however, I'll detail later why this scenari=
o should not be ignored as a corner case; for now, let's just assume we sti=
ll want to handle this scenario efficiently).

After the compaction test is done, I scrubbed debug.log for patterns that m=
atch  the "Compacted" summary so that I can see how long each individual co=
mpaction took and how many bytes they processed. The search pattern is like=
 the following:

{noformat}
grep 'Compacted.*standard1' debug.log
{noformat}

Interestingly, I noticed a lot of the finished compactions are marked as ha=
ving *only one* SSTable involved. With the workload mentioned above, the "s=
ingle SSTable" compactions actually consist of the majority of all compacti=
ons (as shown below), so its efficiency can affect the overall compaction t=
hroughput quite a bit.

{noformat}
automaton@0ce59d338-1:~/cassandra-trunk/logs$ grep 'Compacted.*standard1' d=
ebug.log-test1 | wc -l
243
automaton@0ce59d338-1:~/cassandra-trunk/logs$ grep 'Compacted.*standard1' d=
ebug.log-test1 | grep ") 1 sstable" | wc -l
218
{noformat}

By looking at the code, it appears that there's a way to directly edit the =
level of a particular SSTable like the following:

{code}
sstable.descriptor.getMetadataSerializer().mutateLevel(sstable.descriptor, =
targetLevel);
sstable.reloadSSTableMetadata();
{code}

Compared to what we have now (reading the whole single-SSTable from old lev=
el and writing out the same single-SSTable at the new level), the only diff=
erence I could think of by using this approach is that the new SSTable will=
 have the same file name (sequence number) as the old one's, which could br=
eak some assumptions on some other part of the code. However, not having to=
 go through the full read/write IO, and not having to bear the overhead of =
cleaning up the old file, creating the new file, creating more churns in he=
ap and file buffer, it seems the benefits outweigh the inconvenience. So I'=
d argue this JIRA belongs to LHF and should be made available in 3.0.x as w=
ell.


> For LCS, single SSTable up-level is handled inefficiently
> ---------------------------------------------------------
>
>                 Key: CASSANDRA-12526
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-1252=
6
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Compaction
>            Reporter: Wei Deng
>              Labels: compaction, lcs, performance
>
> I'm using the latest trunk (as of August 2016, which probably is going to=
 be 3.10) to run some experiments on LeveledCompactionStrategy and noticed =
this inefficiency.
> The test data is generated using cassandra-stress default parameters (key=
space1.standard1), so as you can imagine, it consists of a ton of newly ins=
erted partitions that will never merge in compactions, which is probably th=
e worst kind of workload for LCS (however, I'll detail later why this scena=
rio should not be ignored as a corner case; for now, let's just assume we s=
till want to handle this scenario efficiently).
> After the compaction test is done, I scrubbed debug.log for patterns that=
 match  the "Compacted" summary so that I can see how long each individual =
compaction took and how many bytes they processed. The search pattern is li=
ke the following:
> {noformat}
> grep 'Compacted.*standard1' debug.log
> {noformat}
> Interestingly, I noticed a lot of the finished compactions are marked as =
having *only one* SSTable involved. With the workload mentioned above, the =
"single SSTable" compactions actually consist of the majority of all compac=
tions (as shown below), so its efficiency can affect the overall compaction=
 throughput quite a bit.
> {noformat}
> automaton@0ce59d338-1:~/cassandra-trunk/logs$ grep 'Compacted.*standard1'=
 debug.log-test1 | wc -l
> 243
> automaton@0ce59d338-1:~/cassandra-trunk/logs$ grep 'Compacted.*standard1'=
 debug.log-test1 | grep ") 1 sstable" | wc -l
> 218
> {noformat}
> By looking at the code, it appears that there's a way to directly edit th=
e level of a particular SSTable like the following:
> {code}
> sstable.descriptor.getMetadataSerializer().mutateLevel(sstable.descriptor=
, targetLevel);
> sstable.reloadSSTableMetadata();
> {code}
> Compared to what we have now (reading the whole single-SSTable from old l=
evel and writing out the same single-SSTable at the new level), the only di=
fference I could think of by using this approach is that the new SSTable wi=
ll have the same file name (sequence number) as the old one's, which could =
break some assumptions on some other part of the code. However, not having =
to go through the full read/write IO, and not having to bear the overhead o=
f cleaning up the old file, creating the new file, creating more churns in =
heap and file buffer, it seems the benefits outweigh the inconvenience. So =
I'd argue this JIRA belongs to LHF and should be made available in 3.0.x as=
 well.
> As mentioned in the 2nd paragraph, I'm also going to address why this kin=
d of all-new-partition workload should not be ignored as a corner case. Bas=
ically, for the main use case of LCS where you need to frequently merge par=
titions to optimize read and eliminate tombstones and expired data sooner, =
LCS can be perfectly happy and efficiently perform the partition merge and =
tombstone elimination for a long time. However, as soon as the node becomes=
 a bit unhealthy for various reasons (could be a bad disk so it's missing a=
 whole bunch of mutations and need repair, could be the user chooses to ing=
est way more data than it usually takes and exceeds its capability, or god-=
forbidden, some DBA chooses to run offline sstablelevelreset), you will hav=
e to handle this kind of "all-new-partition with a lot of SSTables in L0" s=
cenario, and once all L0 SSTables finally gets up-leveled to L1, you will l=
ikely see a lot of such single-SSTable compactions, which is the situation =
this JIRA is intended to address.


--
This message was sent by Atlassian JIRA
(v6.3.4#6332)