Mailing-List: contact commits-help@cassandra.apache.org; run by ezmlm
Precedence: bulk
Reply-To: dev@cassandra.apache.org
Date: Wed, 13 Apr 2016 21:29:27 +0000 (UTC)
From: "Lucas de Souza Santos (JIRA)" <jira@apache.org>
To: commits@cassandra.apache.org
Message-ID: <JIRA.12841019.1435360695000.223806.1460582967664@Atlassian.JIRA>
In-Reply-To: <JIRA.12841019.1435360695000@Atlassian.JIRA>
References: <JIRA.12841019.1435360695000@Atlassian.JIRA>
 <JIRA.12841019.1435360695705@arcas>
Subject: [jira] [Comment Edited] (CASSANDRA-9666) Provide an alternative to
 DTCS
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: quoted-printable


    [ https://issues.apache.org/jira/browse/CASSANDRA-9666?page=3Dcom.atlas=
sian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=3D=
15240056#comment-15240056 ]=20

Lucas de Souza Santos edited comment on CASSANDRA-9666 at 4/13/16 9:28 PM:
---------------------------------------------------------------------------

I have been working for the biggest online news in Brazil and we are using/=
building a timeseries system with cassandra as persistence.
Since the end of last year we decided to use DTCS. This timeseries implemen=
tation is rolling to production to substitute legacy cacti.

Right now the cluster is composed by 10 Dell PE R6XX, some 610 and some 620=
. All with SAS discs, 8 cpus and 32GB of RAM, Linux Centos 6, kernel 2.
6.32.

Since Jan 20 2016 we are running cassandra21-2.1.12 over JRE 7. At that poi=
nt we were just doing some tests, receiving ~140k points/minute. The cluste=
r was fine and using STCS (the default) as compaction strategy.

At the end of February I changed to DTCS and we doubled the load, passing t=
o around 200k points/minute. A week after, we saw the cpu load growing up, =
together with disc space and memory. First we thought it was us using more,=
 so we built some dashboards to visualize the data.

About 3 weeks ago the cluster started to get some timeouts and we lost a no=
de at least two times, a reboot was needed to get the node back.

Things I have done trying to fix/improve the cluster:

Upgraded jre7 to jdk8, configured GC G1, altered memtable_cleanup_threshold=
 to 0.10 (was using 0.20, getting this value high made the problem worst).

Changed all applications using cassandra to use consistency ONE because GC =
pause was putting nodes out of the cluster and we were receiving a lot of t=
imeouts.
After those changes the cluster was better to use but we were not confident=
 in growing the number of requests. Last week I noticed a problem when rest=
arting any node, it took at lest 15 minutes, sometimes 30 minutes just to l=
oad/open sstables. I checked the data on disc and saw that cassandra create=
d more than 10 million sstables. I couldn't do a simple "ls" in any datadir=
 (I have 14 keyspaces).

Doing a search for cassandra issues with DTCS we found TWCS as an alternati=
ve, and we saw several of the problems we had reported regarding DTCS. I co=
uldn't even wait for an complete test in QA afraid of a crash in production=
, so I decided to apply TWCS in our biggest keyspace. The result was impres=
sive, from more than 2.5 million sstables to around 30 for each node (after=
 full compaction). No data loss, no change in load or memory improvement. G=
iven these results, yesterday (03/12/2016) I decide to apply TWCS to all 14=
 keyspaces, and today the result, at least for me, is mind blowing.

Now I have around 500 sstables per node, sum of all keyspaces, from 10 mill=
ion to 500! The load5 dropped from ~6 to ~0.5, cassandra released around 3G=
B RAM per node. Disc usage drooped from ~150 GB to ~120GB. Right after that=
, the number of requests got up from 120k to 190k requests per minute and w=
e are seeing no change in load.


was (Author: lucasdss):
I have been working for the biggest online news in Brazil and we are using/=
building a timeseries system with cassandra as persistence.
Since the end of last year we decided to use DTCS. This timeseries implemen=
tation is rolling to production to substitute legacy cacti.

Right now the cluster is composed by 10 Dell PE R6XX, some 610 and some 620=
. All with SAS discs, 8 cpus and 32GB of RAM, Linux Centos 6, kernel 2.
6.32.

Since Jan 20 2016 we are running cassandra21-2.1.12 over JRE 7. At that poi=
nt we were just doing some tests, receiving ~140k points/minute. The cluste=
r was fine and using STCS (the default) as compaction strategy.

At the end of February I changed to DTCS and we doubled the load, passing t=
o around 200k points/minute. A week after, we saw the cpu load growing up, =
together with disc space and memory. First we thought it was us using more,=
 so we built some dashboards to visualize the data.

About 3 weeks ago the cluster started to get some timeouts and we lost a no=
de at least two times, a reboot was needed to get the node back.

Things I have done trying to fix/improve the cluster:

Upgraded jre7 to jdk8, configured GC G1, altered memtable_cleanup_threshold=
 to 0.10 (was using 0.20, getting this value high made the problem worst).

Changed all applications using cassandra to use consistency ONE because GC =
pause was putting nodes out of the cluster and we were receiving a lot of t=
imeouts.
After those changes the cluster was better to use but we were not confident=
 in growing the number of requests. Last week I noticed a problem when rest=
arting any node, it took at lest 15 minutes, sometimes 30 minutes just to l=
oad/open sstables. I checked the data on disc and saw that cassandra create=
d more than 10 million sstables. I couldn't do a simple "ls" in any datadir=
 (I have 14 keyspaces).

> Provide an alternative to DTCS
> ------------------------------
>
>                 Key: CASSANDRA-9666
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-9666
>             Project: Cassandra
>          Issue Type: Improvement
>            Reporter: Jeff Jirsa
>            Assignee: Jeff Jirsa
>             Fix For: 2.1.x, 2.2.x
>
>         Attachments: dashboard-DTCS_to_TWCS.png, dtcs-twcs-io.png, dtcs-t=
wcs-load.png
>
>
> DTCS is great for time series data, but it comes with caveats that make i=
t difficult to use in production (typical operator behaviors such as bootst=
rap, removenode, and repair have MAJOR caveats as they relate to max_sstabl=
e_age_days, and hints/read repair break the selection algorithm).
> I'm proposing an alternative, TimeWindowCompactionStrategy, that sacrific=
es the tiered nature of DTCS in order to address some of DTCS' operational =
shortcomings. I believe it is necessary to propose an alternative rather th=
an simply adjusting DTCS, because it fundamentally removes the tiered natur=
e in order to remove the parameter max_sstable_age_days - the result is ver=
y very different, even if it is heavily inspired by DTCS.=20
> Specifically, rather than creating a number of windows of ever increasing=
 sizes, this strategy allows an operator to choose the window size, compact=
 with STCS within the first window of that size, and aggressive compact dow=
n to a single sstable once that window is no longer current. The window siz=
e is a combination of unit (minutes, hours, days) and size (1, etc), such t=
hat an operator can expect all data using a block of that size to be compac=
ted together (that is, if your unit is hours, and size is 6, you will creat=
e roughly 4 sstables per day, each one containing roughly 6 hours of data).=
=20
> The result addresses a number of the problems with DateTieredCompactionSt=
rategy:
> - At the present time, DTCS=E2=80=99s first window is compacted using an =
unusual selection criteria, which prefers files with earlier timestamps, bu=
t ignores sizes. In TimeWindowCompactionStrategy, the first window data wil=
l be compacted with the well tested, fast, reliable STCS. All STCS options =
can be passed to TimeWindowCompactionStrategy to configure the first window=
=E2=80=99s compaction behavior.
> - HintedHandoff may put old data in new sstables, but it will have little=
 impact other than slightly reduced efficiency (sstables will cover a wider=
 range, but the old timestamps will not impact sstable selection criteria d=
uring compaction)
> - ReadRepair may put old data in new sstables, but it will have little im=
pact other than slightly reduced efficiency (sstables will cover a wider ra=
nge, but the old timestamps will not impact sstable selection criteria duri=
ng compaction)
> - Small, old sstables resulting from streams of any kind will be swiftly =
and aggressively compacted with the other sstables matching their similar m=
axTimestamp, without causing sstables in neighboring windows to grow in siz=
e.
> - The configuration options are explicit and straightforward - the tuning=
 parameters leave little room for error. The window is set in common, easil=
y understandable terms such as =E2=80=9C12 hours=E2=80=9D, =E2=80=9C1 Day=
=E2=80=9D, =E2=80=9C30 days=E2=80=9D. The minute/hour/day options are granu=
lar enough for users keeping data for hours, and users keeping data for yea=
rs.=20
> - There is no explicitly configurable max sstable age, though sstables wi=
ll naturally stop compacting once new data is written in that window.=20
> - Streaming operations can create sstables with old timestamps, and they'=
ll naturally be joined together with sstables in the same time bucket. This=
 is true for bootstrap/repair/sstableloader/removenode.=20
> - It remains true that if old data and new data is written into the memta=
ble at the same time, the resulting sstables will be treated as if they wer=
e new sstables, however, that no longer negatively impacts the compaction s=
trategy=E2=80=99s selection criteria for older windows.=20
> Patch provided for :=20
> - 2.1: https://github.com/jeffjirsa/cassandra/commits/twcs-2.1=20
> - 2.2: https://github.com/jeffjirsa/cassandra/commits/twcs-2.2
> - trunk (post-8099):  https://github.com/jeffjirsa/cassandra/commits/twcs=
=20
> Rebased, force-pushed July 18, with bug fixes for estimated pending compa=
ctions and potential starvation if more than min_threshold tables existed i=
n current window but STCS did not consider them viable candidates
> Rebased, force-pushed Aug 20 to bring in relevant logic from CASSANDRA-98=
82


--
This message was sent by Atlassian JIRA
(v6.3.4#6332)