Return-Path: X-Original-To: apmail-cassandra-commits-archive@www.apache.org Delivered-To: apmail-cassandra-commits-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 50ECD19794 for ; Wed, 13 Apr 2016 21:29:28 +0000 (UTC) Received: (qmail 26264 invoked by uid 500); 13 Apr 2016 21:29:27 -0000 Delivered-To: apmail-cassandra-commits-archive@cassandra.apache.org Received: (qmail 26190 invoked by uid 500); 13 Apr 2016 21:29:27 -0000 Mailing-List: contact commits-help@cassandra.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@cassandra.apache.org Delivered-To: mailing list commits@cassandra.apache.org Received: (qmail 26145 invoked by uid 99); 13 Apr 2016 21:29:27 -0000 Received: from arcas.apache.org (HELO arcas) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 13 Apr 2016 21:29:27 +0000 Received: from arcas.apache.org (localhost [127.0.0.1]) by arcas (Postfix) with ESMTP id A329B2C1F5A for ; Wed, 13 Apr 2016 21:29:27 +0000 (UTC) Date: Wed, 13 Apr 2016 21:29:27 +0000 (UTC) From: "Lucas de Souza Santos (JIRA)" To: commits@cassandra.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Comment Edited] (CASSANDRA-9666) Provide an alternative to DTCS MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/CASSANDRA-9666?page=3Dcom.atlas= sian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=3D= 15240056#comment-15240056 ]=20 Lucas de Souza Santos edited comment on CASSANDRA-9666 at 4/13/16 9:28 PM: --------------------------------------------------------------------------- I have been working for the biggest online news in Brazil and we are using/= building a timeseries system with cassandra as persistence. Since the end of last year we decided to use DTCS. This timeseries implemen= tation is rolling to production to substitute legacy cacti. Right now the cluster is composed by 10 Dell PE R6XX, some 610 and some 620= . All with SAS discs, 8 cpus and 32GB of RAM, Linux Centos 6, kernel 2. 6.32. Since Jan 20 2016 we are running cassandra21-2.1.12 over JRE 7. At that poi= nt we were just doing some tests, receiving ~140k points/minute. The cluste= r was fine and using STCS (the default) as compaction strategy. At the end of February I changed to DTCS and we doubled the load, passing t= o around 200k points/minute. A week after, we saw the cpu load growing up, = together with disc space and memory. First we thought it was us using more,= so we built some dashboards to visualize the data. About 3 weeks ago the cluster started to get some timeouts and we lost a no= de at least two times, a reboot was needed to get the node back. Things I have done trying to fix/improve the cluster: Upgraded jre7 to jdk8, configured GC G1, altered memtable_cleanup_threshold= to 0.10 (was using 0.20, getting this value high made the problem worst). Changed all applications using cassandra to use consistency ONE because GC = pause was putting nodes out of the cluster and we were receiving a lot of t= imeouts. After those changes the cluster was better to use but we were not confident= in growing the number of requests. Last week I noticed a problem when rest= arting any node, it took at lest 15 minutes, sometimes 30 minutes just to l= oad/open sstables. I checked the data on disc and saw that cassandra create= d more than 10 million sstables. I couldn't do a simple "ls" in any datadir= (I have 14 keyspaces). Doing a search for cassandra issues with DTCS we found TWCS as an alternati= ve, and we saw several of the problems we had reported regarding DTCS. I co= uldn't even wait for an complete test in QA afraid of a crash in production= , so I decided to apply TWCS in our biggest keyspace. The result was impres= sive, from more than 2.5 million sstables to around 30 for each node (after= full compaction). No data loss, no change in load or memory improvement. G= iven these results, yesterday (03/12/2016) I decide to apply TWCS to all 14= keyspaces, and today the result, at least for me, is mind blowing. Now I have around 500 sstables per node, sum of all keyspaces, from 10 mill= ion to 500! The load5 dropped from ~6 to ~0.5, cassandra released around 3G= B RAM per node. Disc usage drooped from ~150 GB to ~120GB. Right after that= , the number of requests got up from 120k to 190k requests per minute and w= e are seeing no change in load. was (Author: lucasdss): I have been working for the biggest online news in Brazil and we are using/= building a timeseries system with cassandra as persistence. Since the end of last year we decided to use DTCS. This timeseries implemen= tation is rolling to production to substitute legacy cacti. Right now the cluster is composed by 10 Dell PE R6XX, some 610 and some 620= . All with SAS discs, 8 cpus and 32GB of RAM, Linux Centos 6, kernel 2. 6.32. Since Jan 20 2016 we are running cassandra21-2.1.12 over JRE 7. At that poi= nt we were just doing some tests, receiving ~140k points/minute. The cluste= r was fine and using STCS (the default) as compaction strategy. At the end of February I changed to DTCS and we doubled the load, passing t= o around 200k points/minute. A week after, we saw the cpu load growing up, = together with disc space and memory. First we thought it was us using more,= so we built some dashboards to visualize the data. About 3 weeks ago the cluster started to get some timeouts and we lost a no= de at least two times, a reboot was needed to get the node back. Things I have done trying to fix/improve the cluster: Upgraded jre7 to jdk8, configured GC G1, altered memtable_cleanup_threshold= to 0.10 (was using 0.20, getting this value high made the problem worst). Changed all applications using cassandra to use consistency ONE because GC = pause was putting nodes out of the cluster and we were receiving a lot of t= imeouts. After those changes the cluster was better to use but we were not confident= in growing the number of requests. Last week I noticed a problem when rest= arting any node, it took at lest 15 minutes, sometimes 30 minutes just to l= oad/open sstables. I checked the data on disc and saw that cassandra create= d more than 10 million sstables. I couldn't do a simple "ls" in any datadir= (I have 14 keyspaces). > Provide an alternative to DTCS > ------------------------------ > > Key: CASSANDRA-9666 > URL: https://issues.apache.org/jira/browse/CASSANDRA-9666 > Project: Cassandra > Issue Type: Improvement > Reporter: Jeff Jirsa > Assignee: Jeff Jirsa > Fix For: 2.1.x, 2.2.x > > Attachments: dashboard-DTCS_to_TWCS.png, dtcs-twcs-io.png, dtcs-t= wcs-load.png > > > DTCS is great for time series data, but it comes with caveats that make i= t difficult to use in production (typical operator behaviors such as bootst= rap, removenode, and repair have MAJOR caveats as they relate to max_sstabl= e_age_days, and hints/read repair break the selection algorithm). > I'm proposing an alternative, TimeWindowCompactionStrategy, that sacrific= es the tiered nature of DTCS in order to address some of DTCS' operational = shortcomings. I believe it is necessary to propose an alternative rather th= an simply adjusting DTCS, because it fundamentally removes the tiered natur= e in order to remove the parameter max_sstable_age_days - the result is ver= y very different, even if it is heavily inspired by DTCS.=20 > Specifically, rather than creating a number of windows of ever increasing= sizes, this strategy allows an operator to choose the window size, compact= with STCS within the first window of that size, and aggressive compact dow= n to a single sstable once that window is no longer current. The window siz= e is a combination of unit (minutes, hours, days) and size (1, etc), such t= hat an operator can expect all data using a block of that size to be compac= ted together (that is, if your unit is hours, and size is 6, you will creat= e roughly 4 sstables per day, each one containing roughly 6 hours of data).= =20 > The result addresses a number of the problems with DateTieredCompactionSt= rategy: > - At the present time, DTCS=E2=80=99s first window is compacted using an = unusual selection criteria, which prefers files with earlier timestamps, bu= t ignores sizes. In TimeWindowCompactionStrategy, the first window data wil= l be compacted with the well tested, fast, reliable STCS. All STCS options = can be passed to TimeWindowCompactionStrategy to configure the first window= =E2=80=99s compaction behavior. > - HintedHandoff may put old data in new sstables, but it will have little= impact other than slightly reduced efficiency (sstables will cover a wider= range, but the old timestamps will not impact sstable selection criteria d= uring compaction) > - ReadRepair may put old data in new sstables, but it will have little im= pact other than slightly reduced efficiency (sstables will cover a wider ra= nge, but the old timestamps will not impact sstable selection criteria duri= ng compaction) > - Small, old sstables resulting from streams of any kind will be swiftly = and aggressively compacted with the other sstables matching their similar m= axTimestamp, without causing sstables in neighboring windows to grow in siz= e. > - The configuration options are explicit and straightforward - the tuning= parameters leave little room for error. The window is set in common, easil= y understandable terms such as =E2=80=9C12 hours=E2=80=9D, =E2=80=9C1 Day= =E2=80=9D, =E2=80=9C30 days=E2=80=9D. The minute/hour/day options are granu= lar enough for users keeping data for hours, and users keeping data for yea= rs.=20 > - There is no explicitly configurable max sstable age, though sstables wi= ll naturally stop compacting once new data is written in that window.=20 > - Streaming operations can create sstables with old timestamps, and they'= ll naturally be joined together with sstables in the same time bucket. This= is true for bootstrap/repair/sstableloader/removenode.=20 > - It remains true that if old data and new data is written into the memta= ble at the same time, the resulting sstables will be treated as if they wer= e new sstables, however, that no longer negatively impacts the compaction s= trategy=E2=80=99s selection criteria for older windows.=20 > Patch provided for :=20 > - 2.1: https://github.com/jeffjirsa/cassandra/commits/twcs-2.1=20 > - 2.2: https://github.com/jeffjirsa/cassandra/commits/twcs-2.2 > - trunk (post-8099): https://github.com/jeffjirsa/cassandra/commits/twcs= =20 > Rebased, force-pushed July 18, with bug fixes for estimated pending compa= ctions and potential starvation if more than min_threshold tables existed i= n current window but STCS did not consider them viable candidates > Rebased, force-pushed Aug 20 to bring in relevant logic from CASSANDRA-98= 82 -- This message was sent by Atlassian JIRA (v6.3.4#6332)