cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Nikolai Grigoriev (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (CASSANDRA-7949) LCS compaction low performance, many pending compactions, nodes are almost idle
Date Fri, 17 Oct 2014 03:58:34 GMT

    [ https://issues.apache.org/jira/browse/CASSANDRA-7949?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14174702#comment-14174702
] 

Nikolai Grigoriev edited comment on CASSANDRA-7949 at 10/17/14 3:57 AM:
------------------------------------------------------------------------

Update:

Using the property from CASSANDRA-6621 does help to get out of this state. My cluster is slowly
digesting the large sstables and creating bunch of nice small sstables from them. It is slower
than using sstablesplit, I believe, because it actually does real compactions and, thus, processes
and reprocesses different sets of sstables. My understanding is that every time I get new
bunch of L0 sstables there is a phase for updating other levels and it repeats and repeats.

With that property set I see that my total number of sstables grows, my number of "huge" sstables
decreases and the average size of the sstable decreases as result.

My conclusions so far:

1. STCS fallback in LCS is a double-edged sword. It is needed to prevent the flooding the
node with tons of small sstables resulting from ongoing writes. These small ones are often
much smaller than the configured target size and hey need to be merged. But also the use of
STCS results in generation of the super-sized sstables. These become a large headache when
the fallback stops and LCS is supposed to resume normal operations.  It appears to me (my
humble opinion) that fallback should be done to some kind of specialized "rescue" STCS flavor
that merges the small sstables to approximately the LCS target sstable size BUT DOES NOT create
sstables that are much larger than the target size. With this approach the LCS will resume
normal operations much faster than the cause for the fallback (abnormally high write load)
is gone.

2. LCS has major (performance?) issue when you have super-large sstables in the system. It
often gets stuck with single long (many hours) compaction stream that, by itself, will increase
the probability of another STCS fallback even with reasonable write load. As a possible workaround
I was recommended to consider running multiple C* instances on our relatively powerful machines
- to significantly reduce the amount of data per node and increase compaction throughput.

3. In the existing systems, depending on the severity of the STCS fallback "work", the fix
from CASSANDRA-6621 may help to recover while keeping the nodes up. It will take a very long
time to recover but the nodes will be online.

4. Recovery (see above) is very long. It is much much longer than the duration of the "stress
period" that causes the condition. In my case I was writing like crazy for about 4 days and
it's been over a week of compactions after that. I am still very far from 0 pending compactions.
Considering this it makes sense to artificially throttle the write speed when generating the
data (like in the use case I described in previous comments). Extra time spent on writing
the data will be still significantly  shorter than the amount of time required to recover
from the consequences of abusing the available write bandwidth.


was (Author: ngrigoriev@gmail.com):
Update:

Using the property from CASSANDRA-6621 does help to get out of this state. My cluster is slowly
digesting the large sstables and creating bunch of nice small sstables from them. It is slower
than using sstablesplit, I believe, because it actually does real compactions and, thus, processes
and reprocesses different sets of sstables. My understanding is that every time I get new
bunch of L0 sstables there is a phase for updating other levels and it repeats and repeats.

With that property set I see that my total number of sstables grows, my number of "huge" sstables
decreases and the average size of the sstable decreases as result.

My conclusions so far:

1. STCS fallback in LCS is a double-edged sword. It is needed to prevent the flooding the
node with tons of small sstables resulting from ongoing writes. These small ones are often
much smaller than the configured target size and hey need to be merged. But also the use of
STCS results in generation of the super-sized sstables. These become a large headache when
the fallback stops and LCS is supposed to resume normal operations.  It appears to me (my
humble opinion) that fallback should be done to some kind of specialized "rescue" STCS flavor
that merges the small sstables to approximately the LCS target sstable size BUT DOES NOT create
sstables that are much larger than the target size. With this approach the LCS will resume
normal operations much faster than the cause for the fallback (abnormally high write load)
is gone.

2. LCS has major (performance?) issue when you have super-large sstables in the system. It
often gets stuck with single long (many hours) compaction stream that, by itself, will increase
the probability of another STCS fallback even with reasonable write load. As a possible workaround
I was recommended to consider running multiple C* instances on our relatively powerful machines
- to significantly reduce the amount of data per node and increase compaction throughput.

3. In the existing systems, depending on the severity of the STCS fallback "work" the fix
from CASSANDRA-6621 may help to recover while keeping the nodes up. It will take a very long
time to recover but the nodes will be online.

4. Recovery (see above) is very long. It is much much longer than the duration of the "stress
period" that causes the condition. In my case I was writing like crazy for about 4 days and
it's been over a week of compactions after that. I am still very far from 0 pending compactions.
Considering this it makes sense to artificially throttle the write speed when generating the
data (like in the use case I described in previous comments). Extra time spent on writing
the data will be still significantly  shorter than the amount of time required to recover
from the consequences of abusing the available write bandwidth.

> LCS compaction low performance, many pending compactions, nodes are almost idle
> -------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-7949
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-7949
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>         Environment: DSE 4.5.1-1, Cassandra 2.0.8
>            Reporter: Nikolai Grigoriev
>         Attachments: iostats.txt, nodetool_compactionstats.txt, nodetool_tpstats.txt,
pending compactions 2day.png, system.log.gz, vmstat.txt
>
>
> I've been evaluating new cluster of 15 nodes (32 core, 6x800Gb SSD disks + 2x600Gb SAS,
128Gb RAM, OEL 6.5) and I've built a simulator that creates the load similar to the load in
our future product. Before running the simulator I had to pre-generate enough data. This was
done using Java code and DataStax Java driver. To avoid going deep into details, two tables
have been generated. Each table currently has about 55M rows and between few dozens and few
thousands of columns in each row.
> This data generation process was generating massive amount of non-overlapping data. Thus,
the activity was write-only and highly parallel. This is not the type of the traffic that
the system will have ultimately to deal with, it will be mix of reads and updates to the existing
data in the future. This is just to explain the choice of LCS, not mentioning the expensive
SSD disk space.
> At some point while generating the data I have noticed that the compactions started to
pile up. I knew that I was overloading the cluster but I still wanted the genration test to
complete. I was expecting to give the cluster enough time to finish the pending compactions
and get ready for real traffic.
> However, after the storm of write requests have been stopped I have noticed that the
number of pending compactions remained constant (and even climbed up a little bit) on all
nodes. After trying to tune some parameters (like setting the compaction bandwidth cap to
0) I have noticed a strange pattern: the nodes were compacting one of the CFs in a single
stream using virtually no CPU and no disk I/O. This process was taking hours. After that it
would be followed by a short burst of few dozens of compactions running in parallel (CPU at
2000%, some disk I/O - up to 10-20%) and then getting stuck again for many hours doing one
compaction at time. So it looks like this:
> # nodetool compactionstats
> pending tasks: 3351
>           compaction type        keyspace           table       completed           total
     unit  progress
>                Compaction      myks     table_list1     66499295588   1910515889913 
   bytes     3.48%
> Active compaction remaining time :        n/a
> # df -h
> ...
> /dev/sdb        1.5T  637G  854G  43% /cassandra-data/disk1
> /dev/sdc        1.5T  425G  1.1T  29% /cassandra-data/disk2
> /dev/sdd        1.5T  429G  1.1T  29% /cassandra-data/disk3
> # find . -name **table_list1**Data** | grep -v snapshot | wc -l
> 1310
> Among these files I see:
> 1043 files of 161Mb (my sstable size is 160Mb)
> 9 large files - 3 between 1 and 2Gb, 3 of 5-8Gb, 55Gb, 70Gb and 370Gb
> 263 files of various sized - between few dozens of Kb and 160Mb
> I've been running the heavy load for about 1,5days and it's been close to 3 days after
that and the number of pending compactions does not go down.
> I have applied one of the not-so-obvious recommendations to disable multithreaded compactions
and that seems to be helping a bit - I see some nodes started to have fewer pending compactions.
About half of the cluster, in fact. But even there I see they are sitting idle most of the
time lazily compacting in one stream with CPU at ~140% and occasionally doing the bursts of
compaction work for few minutes.
> I am wondering if this is really a bug or something in the LCS logic that would manifest
itself only in such an edge case scenario where I have loaded lots of unique data quickly.
> By the way, I see this pattern only for one of two tables - the one that has about 4
times more data than another (space-wise, number of rows is the same). Looks like all these
pending compactions are really only for that larger table.
> I'll be attaching the relevant logs shortly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message